AMD rumor

dmens · Jun 23, 2006

*shrug*

The fact you take the absurd term "reverse hyperthreading" at face value demonstrates how little you know. Why bother explaining actual technical details to dilettantes? Have a nice day.

Tsuwamono · Jun 23, 2006

The application doesnt decide if its going to be HTed or Reverse HTed. Its the CPU itself. Its taking the two cores and blending the data paths together like they do with ATIs Crossfire to force both cores to work on the same thread of the program at once. IMO this article is most likely truth. It would make sense for AMD to keep it hidden to the public for so long. Its called a trump card. Much like the 7800 GTX was Nvidia's trump card against ATIs release of the X1800. We will just have to see if it turns out to be true of False but as i said before. its possible, and i believe its reality ATM.

F1shF4t · Jun 23, 2006

Originally posted by: dmens
*shrug*

The fact you take the absurd term "reverse hyperthreading" at face value demonstrates how little you know. Why bother explaining actual technical details to dilettantes? Have a nice day.

dmens he was saying that "Reverse Hyperthreading is not taking a Hyperthreaded application and removing the HT-ness from it."

Where did u get the idea thats what he was saying, if u read the rest of the post properly, he has quite a good point. You did the same to me and now ur doing it to SunnyD, like i said read, understand then post.

You are right that there is no such thing an HT app, if fact any multithreaded app will take advantage of HT as it will see 2 processors it does not know is it virtual or not. (thats where the problem with intel 840 was when it came to some multithreaded apps)

dmens · Jun 23, 2006

Gimme a break. The only thing on your post came out different to me was that you were referring to the rumor, not the actual technology, as "it" in your second post. You might want to clarify next time, so your point is clear.

So you read his post? Care to explain what "HT-ness" is?

Viditor · Jun 24, 2006

We certainly need some clarification here (I just don't know if I'm good enough to do that...).

Let's start with some simplifications:

1. SMT (HT is a form of SMT) allows for code that is written in parallel (in other words, the people who wrote the code have taken a program and written it so that it can use multiple threads at the same time) to operate more efficiently. Quoting from the Intel article I linked:

Thread-level parallelism?the ability to simultaneously process multiple instruction streams or threads?can dramatically improve overall performance. Each of these threads can correspond to a different part of a program and runs on one of the multiple hardware contexts available through multi-core and multithreaded designs

It also allows for seperate program threads to start at the same time (multitasking). It is not as powerful as SMP (multi-core) because while SMT can start multiple threads, it only has the resources to process one of them completely at a time (SMP can fully process for as many cores as it's using). However, by starting multiple threads, it allows for the processor to be as busy as it can possibly be at all times.
The only drawback here is when there is a conflict with one of the threads it's started...if there's a problem it has to go back and restart the thread or start a different one. This occurs when the written code isn't optimised properly for HT (and is the reason that there are circumstances where HT actually slows things down). That said, this is not a common occurence.
Therefore, SMP functions best only when the code has been optimised for it...again, to quote from the article:

Today we rely on the software developer to express parallelism in the application, or we depend on automatic tools (compilers) to extract this parallelism. These methods are only partially successful

2. "Reverse HT" (otherwise known as Speculative Threading) allows the compiler to "guess" the outcome when trying to create parallelism, and store those guesses (or Speculative Threads) in memory. What happens next is that the CPU compares those STs and uses what it needs then discards the rest. Therefore, rather than taking multiple parallel threads pre-designed for a single process, RHT takes a single thread and splits it into multiple specualative threads. While this would be somewhat useful in dual core, the more cores you add the better it gets. So quad and 8 core systems would massively benefit from this, even for single threaded apps (like most games) that haven't been written for it.

3. Most important to remember is the fact that it takes BOTH hardware and software to make this work. While AMD is (IMHO) the best hardware innovater today (or at least tied with Intel), their software division isn't even a pitiful shadow of Intel's by comparison.
So unless AMD has some very hidden plans (like a secret agreement with Pathscale or Sun to write a new compiler), this rumour could at best mean that AMD will be ready with the hardware once Intel has written the software (and not before).
I somewhat disagree with dmens about it not being possible for K8L, but I respect his experience (this is his profession) and will remain on the fence about the possibility. I still think it's possible with what we have seen of K8L, but I'm not betting any money on it.

dmens · Jun 25, 2006

Now that vidtor took the time to write up the primer (thanks), consider this. Current branch predictors are hitting 90% accuracy rates even on unfriendly loads such as compilers and AI computations. How deep would such a hypothetical machine need to speculate before it starts yielding significant performance benefits? Much more than 4, 8, or 16, which is about as many cores we'll see in the near future.

Moreoever, using a conventional wide-issue OOO core as an execution "chunk" for this machine would be absurd because the core will be using its full power draw whether it is doing real or throwaway work. Given the fact that such a machine, even with fantasic compiler support, would be doing more throwaway work than prescott replay, the overall power draw, even with a insufficient depth of 16, would be through the roof, while yielding minimal performance gains.

So obviously the only way this technology is possible is with a large array of tiny cores, which K8L obviously does not have, looking at its floorplan.

Viditor · Jun 25, 2006

Originally posted by: dmens
Now that vidtor took the time to write up the primer (thanks), consider this. Current branch predictors are hitting 90% accuracy rates even on unfriendly loads such as compilers and AI computations. How deep would such a hypothetical machine need to speculate before it starts yielding significant performance benefits? Much more than 4, 8, or 16, which is about as many cores we'll see in the near future.

Moreoever, using a conventional wide-issue OOO core as an execution "chunk" for this machine would be absurd because the core will be using its full power draw whether it is doing real or throwaway work. Given the fact that such a machine, even with fantasic compiler support, would be doing more throwaway work than prescott replay, the overall power draw, even with a insufficient depth of 16, would be through the roof, while yielding minimal performance gains.

So obviously the only way this technology is possible is with a large array of tiny cores, which K8L obviously does not have, looking at its floorplan.

Actually, I was thinking that K8L would be using their new virtualization technology (Pacifica) for this...

BrownTown · Jun 25, 2006

well, on one end I think that there are deffinitely many places where specualtion would yield considerable speedups, but I'm just not sure that current processors will have a large enough instruction window or accurate enough algorithms to identify these places efficiently. Also, I think it would be alot more efficient if the programmer or complier did the work to extract TLP isntead of the CPU.

EDIT: hehe, I just hope that these CPUs are better at speculation then the people posting wild-ass theories over at XS

dmens · Jun 25, 2006

virtualization is just another priority level, it has nothing to do with this speculation stuff.

stardrek · Jun 25, 2006

Let me preface this with saying that I am extremely drunk right now and it is almost 5am here on the east coast. The ideas discussed here about ?Anti-hyperthreading? as AMD people are calling it, or Speculative Precomputation, as Intel likes to call it, have been talked about for some time in the Super Computer realm. I have dealt with massive multi-processor systems for some time at the data recovery business I work for and have learned quite a bit about the idea of Speculative Precomputation (for future reference I will be calling this Speculative Precomputation or SP because that is the name Intel has given this idea).

This differs from Simultaneous Multithreading (SMT), demonstrated by Alpha long ago, in that SMT allows for multiple threads to be executed simultaneously, as Viditor explained. The idea behind SP is that it allows for a single threaded application to be analyzed and threads that are predicted to being needed are used to predicted future data that will be needed in the cache. The predicted data is then feed to another processor?s (or core?s, as it may be) cache to be executed. The second processor, or core, then attempts to generate possible cache miss events in an attempt to avoid misses for future executions on any other processor or core. This means that there will be less misses then there would be if a single processor were doing the prefetching tasks on its own and can prevent the unloading of other data that would be in the pipeline behind it. This is great for instructions that would be hard to predict and is something that can be incredibly helpful for the execution of single threaded apps that have lots of complex tasks for a CPU that could have threads that would be hard to predict.

Overall this cuts down on the bandwidth taxation that processors have on the memory and the system as a whole. Because prediction tasks are being taken care of by other processors/cores then the processor/core that is doing all the single threaded executions doesn?t have to worry about dumping and reloading the L1 cache all the time. It takes relatively few misses to screw up most of the data that would be perfeched and loaded into the L1 cache and because of the extremely limited space in the L1 cache being able to fully utilize all the space with useful data would be a godsend. This idea applies to L2 and L3 caches as well, and could be used to really load up a processor to its full potential. This could be something truly wonderful for many applications because a single miss in a prefetch can cause many other misses with later prefetches.

This has been discussed in a paper by Intel and demonstrated in a lab with Itanium processors, which I found out about from an HP rep when talking to him about future SuperDome systems.

The idea isn?t terribly new, but it would be great to see it in the consumer market, as opposed to the enterprise market. When I sober up, and find the time, I will track down the paper I was talking about, but I believe I outlined the general idea of the subsystems pretty well. If I wasn't really clear with everything, I apologize. If there are any edits that need to be made just state them and I will fix whatever I missed.

harryPOThead · Jun 25, 2006

qute:
'It seems that all AM2 CPUs were outfitted with a support for Reverse-HyperThreading, an architectural change which enables software to think that it is working on a single-core alone. By combining two cores, the company has been able to produce the six IPC "core" that will go head to head against four IPC "core" from Conroe/Merom/WoodCrest combo.' ........taken from the Inquirer, read with a salt shaker at hand

HokieESM · Jun 25, 2006

Fox5, I'm with you.

I do a lot of computational work (CSM/CFD/CEM), and this kind of thing has been worked on for years (from the software and system-level hardware end). Don't get me wrong, the technical achievement from AMD/Intel's perspective (prediction/threading at a processor-level). But it does rely a LOT on the programmers. And for all of the speculators out there--this kind of thing is HIGHLY application dependent. For the more traditional--custom-written, mind you--programs run on large multi-processor machines, I've seen speedups that are approaching theoretical (with only 1% overhead), but I've also seen programs (sadly, where my research is in), where the speedup is less than half of theoretical (and decreases with each processor added).

It IS a neat thing--and if the penalty in price (which is usually a function of yield and the number of extra transistors) and heat isn't too great, I hope they come through with it.

This "reverse hyperthreading" seems pointless assuming that devs learn how to properly multithread their code. Of course, PC programmers aren't the most efficient bunch in general.

spaceghost21 · Jun 25, 2006

Originally posted by: dmens
fanboy bs.

trivik12 · Jun 25, 2006

if the code is single threaded, one instruction is waiting for the previous instruction to be processed. How can we then process 2 instructions in parallel!!! I am confused how this technology can help single threaded applications. If the code itself is multi threaded then its a different story.

Fox5 · Jun 25, 2006

Originally posted by: trivik12
if the code is single threaded, one instruction is waiting for the previous instruction to be processed. How can we then process 2 instructions in parallel!!! I am confused how this technology can help single threaded applications. If the code itself is multi threaded then its a different story.

Well, it could work by multithreading single threaded code that is capable of being multithreaded, but it doesn't sound like it does that. Rather, it sounds like what it does is when it comes to a branch in the code, one processor will compute one branch and the other processor will compute the other and it will take the correct one out of the two. If that's what it does, I could see it giving a 5% to 30% increase in performance, and it may explain why amd has cut-down the max cache size to 512KB, yet the new processors have a die size about the same as the old 1MB cache processors. Though, the larger die size could also be attributed to the pacifica tech and the ddr2 memory controller, but it takes an awful lot of additional logic to be equal in size to 512KB of cache.

stardrek · Jun 25, 2006

Originally posted by: Fox5

Originally posted by: trivik12
if the code is single threaded, one instruction is waiting for the previous instruction to be processed. How can we then process 2 instructions in parallel!!! I am confused how this technology can help single threaded applications. If the code itself is multi threaded then its a different story.

Click to expand...

Well, it could work by multithreading single threaded code that is capable of being multithreaded, but it doesn't sound like it does that. Rather, it sounds like what it does is when it comes to a branch in the code, one processor will compute one branch and the other processor will compute the other and it will take the correct one out of the two. If that's what it does, I could see it giving a 5% to 30% increase in performance, and it may explain why amd has cut-down the max cache size to 512KB, yet the new processors have a die size about the same as the old 1MB cache processors. Though, the larger die size could also be attributed to the pacifica tech and the ddr2 memory controller, but it takes an awful lot of additional logic to be equal in size to 512KB of cache.

This is sort of what I was saying but I wrote in a little more detail. I think my post was overlooked because it was the last one on page 3. I explain it as follows:

"Let me preface this with saying that I am extremely drunk right now and it is almost 5am here on the east coast. The ideas discussed here about ?Anti-hyperthreading? as AMD people are calling it, or Speculative Precomputation, as Intel likes to call it, have been talked about for some time in the Super Computer realm. I have dealt with massive multi-processor systems for some time at the data recovery business I work for and have learned quite a bit about the idea of Speculative Precomputation (for future reference I will be calling this Speculative Precomputation or SP because that is the name Intel has given this idea).

This differs from Simultaneous Multithreading (SMT), demonstrated by Alpha long ago, in that SMT allows for multiple threads to be executed simultaneously, as Viditor explained. The idea behind SP is that it allows for a single threaded application to be analyzed and threads that are predicted to being needed are used to predicted future data that will be needed in the cache. The predicted data is then feed to another processor?s (or core?s, as it may be) cache to be executed. The second processor, or core, then attempts to generate possible cache miss events in an attempt to avoid misses for future executions on any other processor or core. This means that there will be less misses then there would be if a single processor were doing the prefetching tasks on its own and can prevent the unloading of other data that would be in the pipeline behind it. This is great for instructions that would be hard to predict and is something that can be incredibly helpful for the execution of single threaded apps that have lots of complex tasks for a CPU that could have threads that would be hard to predict.

Overall this cuts down on the bandwidth taxation that processors have on the memory and the system as a whole. Because prediction tasks are being taken care of by other processors/cores then the processor/core that is doing all the single threaded executions doesn?t have to worry about dumping and reloading the L1 cache all the time. It takes relatively few misses to screw up most of the data that would be perfeched and loaded into the L1 cache and because of the extremely limited space in the L1 cache being able to fully utilize all the space with useful data would be a godsend. This idea applies to L2 and L3 caches as well, and could be used to really load up a processor to its full potential. This could be something truly wonderful for many applications because a single miss in a prefetch can cause many other misses with later prefetches.

This has been discussed in a paper by Intel and demonstrated in a lab with Itanium processors, which I found out about from an HP rep when talking to him about future SuperDome systems.

The idea isn?t terribly new, but it would be great to see it in the consumer market, as opposed to the enterprise market. When I sober up, and find the time, I will track down the paper I was talking about, but I believe I outlined the general idea of the subsystems pretty well. If I wasn't really clear with everything, I apologize. If there are any edits that need to be made just state them and I will fix whatever I missed. "

Greenman · Jun 25, 2006

Couldn't this work just like a cache does right now? cpu one gets an instruction, cpu two gets the next instruction at the same time, then both are cached and placed in the correct order?
Just a random thought, I know zero about how this stuff works.

AMD rumor

dmens

Platinum Member

Tsuwamono

Senior member

F1shF4t

Golden Member

dmens

Platinum Member

Viditor

Diamond Member

dmens

Platinum Member

Viditor

Diamond Member

BrownTown

Diamond Member

dmens

Platinum Member

stardrek

Senior member

harryPOThead

Junior Member

HokieESM

Senior member

spaceghost21

Senior member

trivik12

Senior member

Fox5

Diamond Member

stardrek

Senior member

Greenman

Lifer

TRENDING THREADS