Search results

C
4th Generation Intel Core, Haswell summarized

The HSA roadmap (the architecture used by the AMD's APUs) runs till at least 2014. So yes it's definitely a longer term plan. But it's really not just a hardware problem. They have to try and convince developers to adopt a quirky heterogeneous way of computing to access an integrated GPU that...
- CPUarchitect
- Post #150
- Sep 18, 2012
- Forum: CPUs and Overclocking
C
4th Generation Intel Core, Haswell summarized

Not really. An APU is an "Accelerated" Processing Unit, meaning a CPU and GPU on a single die, with the explicit intention of using the GPU to perform generic high throughput workloads instead of the CPU. This is heterogeneous computing. Haswell has a GPU too, but its CPU cores are more...
- CPUarchitect
- Post #145
- Sep 18, 2012
- Forum: CPUs and Overclocking
C
4th Generation Intel Core, Haswell summarized

Don't forget vector workloads, and any scalar floating-point workload for that matter too. All of these benefit from having execution ports 0 and 1 available for vector or floating-point operations, while the new port 6 takes over the ALU, shift and branch operations from port 0 (with port 5...
- CPUarchitect
- Post #144
- Sep 18, 2012
- Forum: CPUs and Overclocking
C
4th Generation Intel Core, Haswell summarized

It's the same process node, but they've added 33% more execution ports! So is it really that tough to imagine? Also, the IPC gain from AVX2 is... wait for it... nada. AVX2 isn't about Instructions Per Clock, it's all about doing twice the amount of work per instruction. That said...
- CPUarchitect
- Post #136
- Sep 18, 2012
- Forum: CPUs and Overclocking
C
Wait, is an i5-3450 doing this?

That doesn't mean a slower CPU would give you equal performance in this game! Think of some application tasks as a relay race. If you have four runners then each of them is active for only 1/4 of the time, but when they do run they run as fast as they can. With slower runners the activity is...
- CPUarchitect
- Post #19
- Jul 20, 2012
- Forum: CPUs and Overclocking
C
AMD summit today; Kaveri cuts out the middle man in Trinity.

Yes but that's because the load ports are still 128-bit each and they have to sync up to handle 256-bit. Hence dealing with unaligned 256-bit data is very problematic. Haswell will make them 256-bit each so vmovups will become faster than two 128-bit loads. And none of this is even relevant...
- CPUarchitect
- Post #154
- Jul 3, 2012
- Forum: CPUs and Overclocking
C
AMD summit today; Kaveri cuts out the middle man in Trinity.

Indeed. In the same spot where contiguous data is extracted from a cache line, just eight times in parallel. It's really not a whole lot of extra circuitry. It's basically just simple unidirectional shifters with byte granularity and a narrow 32-bit output (combining multiple ones for 64-bit...
- CPUarchitect
- Post #152
- Jul 3, 2012
- Forum: CPUs and Overclocking
C
AMD summit today; Kaveri cuts out the middle man in Trinity.

Because that's an averaged out result and the two arithmetic uops can execute on either port 0 or port 5. Also look at the result for the add instruction in your example: it's a single uop which can execute on port 0, 1, or 5, and hence the numbers for each port add up to 1.0. No, that's...
- CPUarchitect
- Post #149
- Jul 2, 2012
- Forum: CPUs and Overclocking
C
AMD summit today; Kaveri cuts out the middle man in Trinity.

I couldn't find any evidence of VMASKMOV loads being unexpectedly slow in any way. And I'm not lazy, I tested it in practice too: it has a 1 cycle reciprocal throughput. Also in the Intel thread you're linking to, engineer Mark Buxton explains why they are "extremely useful". The only thing...
- CPUarchitect
- Post #136
- Jun 30, 2012
- Forum: CPUs and Overclocking
C
AMD summit today; Kaveri cuts out the middle man in Trinity.

AVX2 is not just about floating-point performance. It also doubles the throughput of parallel integer workloads. I didn't mean to "derail" this thread at all. In fact the thread was split into an official AVX2 thread, but the discussion here still ended up being about how Kaveri and HSA will...
- CPUarchitect
- Post #133
- Jun 30, 2012
- Forum: CPUs and Overclocking
C
AMD summit today; Kaveri cuts out the middle man in Trinity.

Why contest my theory when you can't defend your own? And if I was "just a software guy" like you then why am I able to present a perfectly plausible uop breakdown of Haswell's gather support, while you can't? Please don't make such assumptions to try and get personal because you're out of...
- CPUarchitect
- Post #132
- Jun 30, 2012
- Forum: CPUs and Overclocking
C
AMD summit today; Kaveri cuts out the middle man in Trinity.

Could you tell me the uop breakdown that would result in a 2 cycle reciprocal issue throughput?
- CPUarchitect
- Post #129
- Jun 29, 2012
- Forum: CPUs and Overclocking
C
AMD summit today; Kaveri cuts out the middle man in Trinity.

1 cycle. And that's not an estimate but a fact, confirmed both by timing an unrolled loop containing only VMASKMOV instructions, and its uop decomposition by IACA. And it seems pretty obvious what each of those uops do by comparing its functionality against VMOVMSK and VBLEND. Reciprocal...
- CPUarchitect
- Post #127
- Jun 28, 2012
- Forum: CPUs and Overclocking
C
AMD summit today; Kaveri cuts out the middle man in Trinity.

I don't think that's a correct conclusion. VMASKMOV consists of more uops than VMOVAPS, so if you're already occupying the ports for the extra uops then it adds to the critical path. It's just a coincidence that your VMOVAPS can use underutilized ports so you basically got it for free. That...
- CPUarchitect
- Post #123
- Jun 28, 2012
- Forum: CPUs and Overclocking
C
AMD summit today; Kaveri cuts out the middle man in Trinity.

After the detailed analysis I concluded it will likely be capable of sustaining a peak throughput of one gather operation each cycle. The mask register can be initialized using vcmpeq on port 1, it can then be compacted by a vmovmsk on port 0, then port 3 can do the actual gather load, port 2...
- CPUarchitect
- Post #120
- Jun 28, 2012
- Forum: CPUs and Overclocking
C
AMD summit today; Kaveri cuts out the middle man in Trinity.

Alright, I've collected some hard data to get a better picture of how homogeneous and heterogeneous computing will compare... Sandy/Ivy Bridge's emulation of a gather operation currently takes: 8 uops on port 0 (ALU) 6 uops on port 1 (ALU) 4 uops on port 2 (load) 4 uops on port 3...
- CPUarchitect
- Post #116
- Jun 27, 2012
- Forum: CPUs and Overclocking
C
AMD summit today; Kaveri cuts out the middle man in Trinity.

As explained, I was referring to the ratio of loading data and processing it. You typically can't increase the arithmetic workload without also loading more data. Hence the fact that emulating gather with extract/insert leaves some execution ports underutilized shouldn't be regarded as a useful...
- CPUarchitect
- Post #108
- Jun 25, 2012
- Forum: CPUs and Overclocking
C
AMD summit today; Kaveri cuts out the middle man in Trinity.

That would be true if you could freely increase the arithmetic workload. In reality the ratio between loading data and processing it is practically fixed by the algorithm. It looks like Haswell should be capable of sustaining one gather instruction and one FMA each cycle. That way things like...
- CPUarchitect
- Post #106
- Jun 25, 2012
- Forum: CPUs and Overclocking
C
AMD summit today; Kaveri cuts out the middle man in Trinity.

If this is what you're referring to, then I'm counting 7 extract instructions and 7 insert instructions. But yes, using movd for the lower ones is pretty clever and I didn't realize they can execute on any arithmetic port. Still, all 18 instructions will be replaced with a single one on Haswell...
- CPUarchitect
- Post #104
- Jun 25, 2012
- Forum: CPUs and Overclocking
C
AMD summit today; Kaveri cuts out the middle man in Trinity.

No, you appear to be forgetting about the extract/insert instructions. With the conservative Haswell architecture of one gather port and one regular load port, it can ideally sustain one gather and one regular load each cycle, right? Without gather that's 9 regular load operations and you are...
- CPUarchitect
- Post #101
- Jun 24, 2012
- Forum: CPUs and Overclocking
C
AMD summit today; Kaveri cuts out the middle man in Trinity.

But that would be a test where different port counts are being used. Assuming one gather port and one regular load port for Haswell, the second port would be available for more work while in the extract/insert version you're occupying both. Well we can at least assume that Haswell's gather...
- CPUarchitect
- Post #99
- Jun 24, 2012
- Forum: CPUs and Overclocking
C
AMD summit today; Kaveri cuts out the middle man in Trinity.

I was looking at how one gather port improves things over one scalar port. Since it's highly unlikely for Haswell to have two gather ports, we have to ignore the second load port when evaluating the effect of gather in isolation. Also while that means its an 8x improvement for one load/gather...
- CPUarchitect
- Post #96
- Jun 23, 2012
- Forum: CPUs and Overclocking
C
AMD summit today; Kaveri cuts out the middle man in Trinity.

No, there's definitely sufficient vectorizable code. The problem is the granularity of it. A heterogeneous system incurs a penalty every time you switch from CPU to GPU processing and back. So you need large enough chunks of parallel code to keep the number of penalties low. In the ideal case a...
- CPUarchitect
- Post #95
- Jun 23, 2012
- Forum: CPUs and Overclocking
C
AMD summit today; Kaveri cuts out the middle man in Trinity.

First of all it is actually very rare in practice, especially for throughput oriented workloads where the access patterns are typically quite regular and prefetching can do an excellent job. Secondly there is Hyper-Threading, so the CPU can switch between two threads. And thirdly there is...
- CPUarchitect
- Post #93
- Jun 23, 2012
- Forum: CPUs and Overclocking
C
AMD summit today; Kaveri cuts out the middle man in Trinity.

There is always a lower bandwidth and higher latency for data transferred between cores versus within cores. A CPU core can execute some scalar code and then a few vector instructions and then some scalar code again, without any sort of hitch. Executing the vector instructions on a...
- CPUarchitect
- Post #91
- Jun 22, 2012
- Forum: CPUs and Overclocking
C
AMD summit today; Kaveri cuts out the middle man in Trinity.

Which is why I said "ruling out discrete GPUs". Kaveri appears to represent AMD's best attempt at tackling this issue. But even so, there will still be a bandwidth bottleneck... You see, when a CPU switches from a sequential scalar workload to a parallel vector workload, all or part of the...
- CPUarchitect
- Post #89
- Jun 22, 2012
- Forum: CPUs and Overclocking
C
AMD summit today; Kaveri cuts out the middle man in Trinity.

First and foremost there's the PCIe bottleneck. It can have a devastating effect on performance, pretty much ruling out discrete GPUs for a whole class of heterogeneous computing. Secondly there's the DRAM bandwidth issue. Even though a CPU and integrated GPU share the same bandwidth, GPU...
- CPUarchitect
- Post #84
- Jun 22, 2012
- Forum: CPUs and Overclocking
C
AMD summit today; Kaveri cuts out the middle man in Trinity.

Neither Haswell nor GK110 is out yet. But who needs the numbers when the technology speaks for itself? GPUs achieve a high theoretical throughput by using wide vector units with gather and FMA support. AVX2 offers the exact same features, but integrates them into the CPU cores themselves thus...
- CPUarchitect
- Post #76
- Jun 21, 2012
- Forum: CPUs and Overclocking
C
AMD summit today; Kaveri cuts out the middle man in Trinity.

I've already shown that for each of those features, homogeneous computing is superior.
- CPUarchitect
- Post #69
- Jun 21, 2012
- Forum: CPUs and Overclocking
C
AMD summit today; Kaveri cuts out the middle man in Trinity.

Sure, but with the CPU often being better at it than the GPU. And next year it's going to get way better at it with twice the throughput per core and gather support. GPGPU is obviously on its decline. It didn't work out for discrete cards; NVIDIA wisely backed out. Now some are still clinging...
- CPUarchitect
- Post #67
- Jun 21, 2012
- Forum: CPUs and Overclocking
C
AMD summit today; Kaveri cuts out the middle man in Trinity.

No, it's definitely a huge part of the story. Do you honestly consider GK104 a mid-end GPU when the cards cost 499 bucks? GF104 (GTX 460) was launched at 199 and 229 MSRP for the 768 MB and 1024 MB version respectively: NVIDIAs GeForce GTX 460: The $200 King. And this just in: GK107 sucks at...
- CPUarchitect
- Post #65
- Jun 21, 2012
- Forum: CPUs and Overclocking
C
AMD summit today; Kaveri cuts out the middle man in Trinity.

What more is there to say about Kaveri?
- CPUarchitect
- Post #62
- Jun 20, 2012
- Forum: CPUs and Overclocking
C
AMD summit today; Kaveri cuts out the middle man in Trinity.

You mean me personally? Firstly I'm concerned for AMD's future if they continue to cripple the CPU to include a bigger GPU, and cripple the GPU's graphics in an attempt to make it better at heterogeneous computing. Intel has strong CPU cores, strong homogeneous throughput computing, and a GPU...
- CPUarchitect
- Post #56
- Jun 20, 2012
- Forum: CPUs and Overclocking
C
AMD summit today; Kaveri cuts out the middle man in Trinity.

I didn't say flagship Kepler chip, I said flagship consumer product. The point is that it makes no sense whatsoever to say that NVIDIA pursues "more GPGPU oriented devices" when leaving out the GTX 680 and GTX 690. And there is no indication that the lower models will have any better GPGPU...
- CPUarchitect
- Post #48
- Jun 20, 2012
- Forum: CPUs and Overclocking
C
AMD summit today; Kaveri cuts out the middle man in Trinity.

Absolutely, but the problem is that improving double-precision support inherently sacrifices graphics performance for a given die size or power consumption. So it's no wonder that all APUs to date have no double-precision support. Kaveri isn't likely to change that if they want to hit the...
- CPUarchitect
- Post #45
- Jun 20, 2012
- Forum: CPUs and Overclocking
C
AMD summit today; Kaveri cuts out the middle man in Trinity.

No. You can't apply supercomputing technology to a cell phone. The issue that supercomputers are facing is that the computing power per node is increasing faster than they can exchange data between them. So Echelon is all about low-latency high-bandwidth technology (caches and interconnections)...
- CPUarchitect
- Post #42
- Jun 20, 2012
- Forum: CPUs and Overclocking
C
AMD summit today; Kaveri cuts out the middle man in Trinity.

You don't see why I'm concerned? Let's try this one more time. Where's the breakthrough in technology that can keep heterogeneous computing superior to homogeneous throughput computing? It's swell that the specification has been finalized, but why the secrecy? If HSA is supposed to be an open...
- CPUarchitect
- Post #41
- Jun 20, 2012
- Forum: CPUs and Overclocking
C
Use iPhone for PS3 bluetooth audio

Do you happen to know if that's a hardware limitation, an (Apple) API limitation, or it just hasn't been programmed that way yet by anybody? I actually have a bit of Objective C programming experience, but I've never dealt with Bluetooth... Thanks for any pointers.
- CPUarchitect
- Post #6
- Jun 20, 2012
- Forum: Console Gaming
C
Use iPhone for PS3 bluetooth audio

Good point but I prefer not to clutter my house with more stuff when I already have a Bluetooth audio device. It also seems like a waste of money if I can get it to work with my phone.
- CPUarchitect
- Post #5
- Jun 20, 2012
- Forum: Console Gaming
C
AMD summit today; Kaveri cuts out the middle man in Trinity.

Not even close. Echelon is a research project for "extreme-scale" computing. A single node is specified to have 20 TFLOPS of computing power, and 256 GB of RAM. That's not going to be put into cell phones any time soon. It's clearly aimed only at the supercomputer market. I was talking about the...
- CPUarchitect
- Post #35
- Jun 20, 2012
- Forum: CPUs and Overclocking

RESOURCES

Top Bottom