Lower AVX clocks

ajp_anton

Junior Member
Jun 13, 2016
3
0
36
Intel's big server CPUs have a feature that lowers the clocks automatically when running AVX instructions. Anandtech's Broadwell review says that it forces the core (or whole CPU back in Haswell) to run at lower speeds for >1ms when a single AVX instruction is detected.

I don't really understand why this happens. The stated reason is that AVX requires more power but turbo boost already takes power consumption into account and should lower the clocks based on that.

What am I missing? What's wrong with turbo boost, and isn't it fundamentally better to fix it instead of guessing how much and for how long to go slower?
 

Phynaz

Lifer
Mar 13, 2006
10,140
819
126
You're not missing anything. Power consumtion is a big deal in the enterprise space.

I don't get what you want them to "fix".
 

itsmydamnation

Diamond Member
Feb 6, 2011
3,020
3,779
136
When you understand that moving data cost way more power then executing data then you understand why AVX256 takes so much power. There is no magic way to move twice the amount of data ( over SSE/AVX128) without consuming near twice the power.
 

ajp_anton

Junior Member
Jun 13, 2016
3
0
36
You're not missing anything. Power consumtion is a big deal in the enterprise space.

I don't get what you want them to "fix".
I know power consumption is a big deal. So why not just let Turbo boost do its thing?, limiting power usage by turboing less. Why introduce some special rule for AVX alone?
 

ajp_anton

Junior Member
Jun 13, 2016
3
0
36
When you understand that moving data cost way more power then executing data then you understand why AVX256 takes so much power. There is no magic way to move twice the amount of data ( over SSE/AVX128) without consuming near twice the power.
I have no problems understanding why AVX(256) uses more power. The question is why introduce a new special rule for AVX when we already have a general instruction-agnostic way to do the same thing?
 

TuxDave

Lifer
Oct 8, 2002
10,571
3
71
I have no problems understanding why AVX(256) uses more power. The question is why introduce a new special rule for AVX when we already have a general instruction-agnostic way to do the same thing?

When you see nuances like this, it's generally an indication that the "general instruction agnostic" approach is lacking in capability in some way. A couple things to consider.

1) The difference in power between AVX and non-AVX traces is large.
2) The response time of the power statistics & turbo is relatively slow vs the response time of switching between AVX and non-AVX workloads

So if your response time is too slow, you may need to add a special hook like "AVX specific throttling" or increase your guardband (makes all non-AVX workloads suck more). If #1 wasn't the case, then increasing your guardband would be reasonable.
 

YBS1

Golden Member
May 14, 2000
1,945
129
106
Anandtech's Broadwell review says that it forces the core (or whole CPU back in Haswell) to run at lower speeds for >1ms when a single AVX instruction is detected.

I'm not sure this is this case with some of the higher end motherboards which allow you to override a lot of Intel's built in protections, otherwise I don't think Asus would have made such a big deal about being careful with AVX stress tests when approaching 4.4GHz.
 

Exophase

Diamond Member
Apr 19, 2012
4,439
9
81
When big functional units are turned on it can cause a big sudden change in current (dI/dt), if this droops the voltage too far it can make the processor stop working correctly. The regular turbo boost feedback loop is probably too slow to react to this in time. AMD described feedback in the DLL to adaptively lower clock speed: http://www.realworldtech.com/steamroller-clocking/ In this case Intel may be mitigating it by simply reducing clock speed before the AVX units are turned on then eventually increasing them.

The other part of this is more of a matter of semantics and marketing. A processor's rated base speed is supposed to be a minimum you can get with all cores running (and a normal environment with proper cooling). By providing that you set expectations for minimum throughput that can be achieved that's not dependent on any dynamic factors. I guess enough people aren't really using AVX that they found it worthwhile to provide a different set of base clocks, with and without AVX.
 

Wall Street

Senior member
Mar 28, 2012
691
44
91
I have no problems understanding why AVX(256) uses more power. The question is why introduce a new special rule for AVX when we already have a general instruction-agnostic way to do the same thing?

I suspect that AVX2 uses so much extra power than Intel doesn't want the processor to have to obey the 'base operating frequency' that the turbo boost algorithm otherwise guarantees as a minimum.
 

Schmide

Diamond Member
Mar 7, 2002
5,639
837
126
I would tend to believe when your instructions work on 512bits and includes 8 or more operands, you can only do operations as fast as your cache lines can load. Pushing above certain ratios probably gives diminished returns.
 

TheRyuu

Diamond Member
Dec 3, 2005
5,479
14
81
Although not really related to the OP's point another interesting thing to keep in mind is that for power saving reasons there may be a small warm up period for AVX instructions where they only run at half (128-bit) throughput (using the lower 128-bits twice for 256-bit instructions) although this was apparently only measurable on Skylake[1].

Agner said:
I observed an interesting phenomenon when executing 256-bit vector instructions on the Skylake. There is a warm-up period of approximately 14 µs before it can execute 256-bit vector instructions at full speed. Apparently, the upper 128-bit half of the execution units and data buses is turned off in order to save power when it is not used. As soon as the processor sees a 256-bit instruction it starts to power up the upper half. It can still execute 256-bit instructions during the warm-up period, but it does so by using the lower 128-bit units twice for every 256-bit vector. The result is that the throughput for 256-bit vectors is 4-5 times slower during this warm-up period. If you know in advance that you will need to use 256-bit instructions soon, then you can start the warm-up process by placing a dummy 256-bit instruction at a strategic place in the code. My measurements showed that the upper half of the units is shut down again after 675 µs of inactivity.

This warm-up phenomenon has reportedly been observed in previous processors as well (see agner.org/optimize/blog/read.php?i=378#378), but I have not observed it before in any of the processors that I have tested. Perhaps some high-end versions of Intel processors have this ability to shut down the upper 128-bit lane in order to save power, while other variants of the same processors have no such feature. This is something that needs further investigation.

And for anyone interested in the previous paragraph to those two on Skylake the annoying penalty for mising VEX and non-VEX has been eliminated.

Agner said:
Previous Intel processors have different states for code that use the AVX instruction sets allowing 256-bit vectors versus legacy code with 128-bit vectors and no VEX prefixes. The Sandy Bridge, Ivy Bridge, Haswell and Broadwell processors all have these states and a serious penalty of 70 clock cycles for state switching when a piece of code accidentally mixed VEX and non-VEX instructions. This annoying state shift and penalty has been eliminated on the Skylake. Apparently, the implementation of 256-bit registers has become more streamlined.

This post might be too technical for here but it's something interesting about AVX on Intel's CPUs.

[1] http://www.agner.org/optimize/blog/read.php?i=415 (last 2 to 3 paragraphs)
 

Exophase

Diamond Member
Apr 19, 2012
4,439
9
81
I would tend to believe when your instructions work on 512bits and includes 8 or more operands, you can only do operations as fast as your cache lines can load. Pushing above certain ratios probably gives diminished returns.

We're talking about AVX2 here, so 256-bits. When you say cache line loads are a bottleneck do you mean misses to main memory? Most real world SIMD programs are probably not spending their entire time being main RAM limited. Really depends on the algorithms but you can do a lot predominantly in cache.
 

TheRyuu

Diamond Member
Dec 3, 2005
5,479
14
81
We're talking about AVX2 here, so 256-bits. When you say cache line loads are a bottleneck do you mean misses to main memory? Most real world SIMD programs are probably not spending their entire time being main RAM limited. Really depends on the algorithms but you can do a lot predominantly in cache.

You're also doubling the amount of registers on AVX512 to 32 registers in 64-bit mode.
 

Schmide

Diamond Member
Mar 7, 2002
5,639
837
126
Well haswell+ can theoretically do 32 fp instructions per clock. However, this is done with a 16 deep pipeline with some 60 entry scheduler and 72 loads 42 stores in flight. If you were to only use registers iterating over all 32 of them assuming 2 registers used per opcode, you could probably keep the 16 stage pipeline full continuously. I would assume this to be a rare case, sooner or later you will have to incur at least a 4 tick cache line penalty for both loads and stores.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |