Search results

  1. C

    4th Generation Intel Core, Haswell summarized

    The HSA roadmap (the architecture used by the AMD's APUs) runs till at least 2014. So yes it's definitely a longer term plan. But it's really not just a hardware problem. They have to try and convince developers to adopt a quirky heterogeneous way of computing to access an integrated GPU that...
  2. C

    4th Generation Intel Core, Haswell summarized

    Not really. An APU is an "Accelerated" Processing Unit, meaning a CPU and GPU on a single die, with the explicit intention of using the GPU to perform generic high throughput workloads instead of the CPU. This is heterogeneous computing. Haswell has a GPU too, but its CPU cores are more...
  3. C

    4th Generation Intel Core, Haswell summarized

    Don't forget vector workloads, and any scalar floating-point workload for that matter too. All of these benefit from having execution ports 0 and 1 available for vector or floating-point operations, while the new port 6 takes over the ALU, shift and branch operations from port 0 (with port 5...
  4. C

    4th Generation Intel Core, Haswell summarized

    It's the same process node, but they've added 33% more execution ports! So is it really that tough to imagine? Also, the IPC gain from AVX2 is... wait for it... nada. AVX2 isn't about Instructions Per Clock, it's all about doing twice the amount of work per instruction. That said...
  5. C

    Wait, is an i5-3450 doing this?

    That doesn't mean a slower CPU would give you equal performance in this game! Think of some application tasks as a relay race. If you have four runners then each of them is active for only 1/4 of the time, but when they do run they run as fast as they can. With slower runners the activity is...
  6. C

    AMD summit today; Kaveri cuts out the middle man in Trinity.

    Yes but that's because the load ports are still 128-bit each and they have to sync up to handle 256-bit. Hence dealing with unaligned 256-bit data is very problematic. Haswell will make them 256-bit each so vmovups will become faster than two 128-bit loads. And none of this is even relevant...
  7. C

    AMD summit today; Kaveri cuts out the middle man in Trinity.

    Indeed. In the same spot where contiguous data is extracted from a cache line, just eight times in parallel. It's really not a whole lot of extra circuitry. It's basically just simple unidirectional shifters with byte granularity and a narrow 32-bit output (combining multiple ones for 64-bit...
  8. C

    AMD summit today; Kaveri cuts out the middle man in Trinity.

    Because that's an averaged out result and the two arithmetic uops can execute on either port 0 or port 5. Also look at the result for the add instruction in your example: it's a single uop which can execute on port 0, 1, or 5, and hence the numbers for each port add up to 1.0. No, that's...
  9. C

    AMD summit today; Kaveri cuts out the middle man in Trinity.

    I couldn't find any evidence of VMASKMOV loads being unexpectedly slow in any way. And I'm not lazy, I tested it in practice too: it has a 1 cycle reciprocal throughput. Also in the Intel thread you're linking to, engineer Mark Buxton explains why they are "extremely useful". The only thing...
  10. C

    AMD summit today; Kaveri cuts out the middle man in Trinity.

    AVX2 is not just about floating-point performance. It also doubles the throughput of parallel integer workloads. I didn't mean to "derail" this thread at all. In fact the thread was split into an official AVX2 thread, but the discussion here still ended up being about how Kaveri and HSA will...
  11. C

    AMD summit today; Kaveri cuts out the middle man in Trinity.

    Why contest my theory when you can't defend your own? And if I was "just a software guy" like you then why am I able to present a perfectly plausible uop breakdown of Haswell's gather support, while you can't? Please don't make such assumptions to try and get personal because you're out of...
  12. C

    AMD summit today; Kaveri cuts out the middle man in Trinity.

    Could you tell me the uop breakdown that would result in a 2 cycle reciprocal issue throughput?
  13. C

    AMD summit today; Kaveri cuts out the middle man in Trinity.

    1 cycle. And that's not an estimate but a fact, confirmed both by timing an unrolled loop containing only VMASKMOV instructions, and its uop decomposition by IACA. And it seems pretty obvious what each of those uops do by comparing its functionality against VMOVMSK and VBLEND. Reciprocal...
  14. C

    AMD summit today; Kaveri cuts out the middle man in Trinity.

    I don't think that's a correct conclusion. VMASKMOV consists of more uops than VMOVAPS, so if you're already occupying the ports for the extra uops then it adds to the critical path. It's just a coincidence that your VMOVAPS can use underutilized ports so you basically got it for free. That...
  15. C

    AMD summit today; Kaveri cuts out the middle man in Trinity.

    After the detailed analysis I concluded it will likely be capable of sustaining a peak throughput of one gather operation each cycle. The mask register can be initialized using vcmpeq on port 1, it can then be compacted by a vmovmsk on port 0, then port 3 can do the actual gather load, port 2...
  16. C

    AMD summit today; Kaveri cuts out the middle man in Trinity.

    Alright, I've collected some hard data to get a better picture of how homogeneous and heterogeneous computing will compare... Sandy/Ivy Bridge's emulation of a gather operation currently takes: 8 uops on port 0 (ALU) 6 uops on port 1 (ALU) 4 uops on port 2 (load) 4 uops on port 3...
  17. C

    AMD summit today; Kaveri cuts out the middle man in Trinity.

    As explained, I was referring to the ratio of loading data and processing it. You typically can't increase the arithmetic workload without also loading more data. Hence the fact that emulating gather with extract/insert leaves some execution ports underutilized shouldn't be regarded as a useful...
  18. C

    AMD summit today; Kaveri cuts out the middle man in Trinity.

    That would be true if you could freely increase the arithmetic workload. In reality the ratio between loading data and processing it is practically fixed by the algorithm. It looks like Haswell should be capable of sustaining one gather instruction and one FMA each cycle. That way things like...
  19. C

    AMD summit today; Kaveri cuts out the middle man in Trinity.

    If this is what you're referring to, then I'm counting 7 extract instructions and 7 insert instructions. But yes, using movd for the lower ones is pretty clever and I didn't realize they can execute on any arithmetic port. Still, all 18 instructions will be replaced with a single one on Haswell...
  20. C

    AMD summit today; Kaveri cuts out the middle man in Trinity.

    No, you appear to be forgetting about the extract/insert instructions. With the conservative Haswell architecture of one gather port and one regular load port, it can ideally sustain one gather and one regular load each cycle, right? Without gather that's 9 regular load operations and you are...
  21. C

    AMD summit today; Kaveri cuts out the middle man in Trinity.

    But that would be a test where different port counts are being used. Assuming one gather port and one regular load port for Haswell, the second port would be available for more work while in the extract/insert version you're occupying both. Well we can at least assume that Haswell's gather...
  22. C

    AMD summit today; Kaveri cuts out the middle man in Trinity.

    I was looking at how one gather port improves things over one scalar port. Since it's highly unlikely for Haswell to have two gather ports, we have to ignore the second load port when evaluating the effect of gather in isolation. Also while that means its an 8x improvement for one load/gather...
  23. C

    AMD summit today; Kaveri cuts out the middle man in Trinity.

    No, there's definitely sufficient vectorizable code. The problem is the granularity of it. A heterogeneous system incurs a penalty every time you switch from CPU to GPU processing and back. So you need large enough chunks of parallel code to keep the number of penalties low. In the ideal case a...
  24. C

    AMD summit today; Kaveri cuts out the middle man in Trinity.

    First of all it is actually very rare in practice, especially for throughput oriented workloads where the access patterns are typically quite regular and prefetching can do an excellent job. Secondly there is Hyper-Threading, so the CPU can switch between two threads. And thirdly there is...
  25. C

    AMD summit today; Kaveri cuts out the middle man in Trinity.

    There is always a lower bandwidth and higher latency for data transferred between cores versus within cores. A CPU core can execute some scalar code and then a few vector instructions and then some scalar code again, without any sort of hitch. Executing the vector instructions on a...
  26. C

    AMD summit today; Kaveri cuts out the middle man in Trinity.

    Which is why I said "ruling out discrete GPUs". Kaveri appears to represent AMD's best attempt at tackling this issue. But even so, there will still be a bandwidth bottleneck... You see, when a CPU switches from a sequential scalar workload to a parallel vector workload, all or part of the...
  27. C

    AMD summit today; Kaveri cuts out the middle man in Trinity.

    First and foremost there's the PCIe bottleneck. It can have a devastating effect on performance, pretty much ruling out discrete GPUs for a whole class of heterogeneous computing. Secondly there's the DRAM bandwidth issue. Even though a CPU and integrated GPU share the same bandwidth, GPU...
  28. C

    AMD summit today; Kaveri cuts out the middle man in Trinity.

    Neither Haswell nor GK110 is out yet. But who needs the numbers when the technology speaks for itself? GPUs achieve a high theoretical throughput by using wide vector units with gather and FMA support. AVX2 offers the exact same features, but integrates them into the CPU cores themselves thus...
  29. C

    AMD summit today; Kaveri cuts out the middle man in Trinity.

    I've already shown that for each of those features, homogeneous computing is superior.
  30. C

    AMD summit today; Kaveri cuts out the middle man in Trinity.

    Sure, but with the CPU often being better at it than the GPU. And next year it's going to get way better at it with twice the throughput per core and gather support. GPGPU is obviously on its decline. It didn't work out for discrete cards; NVIDIA wisely backed out. Now some are still clinging...
  31. C

    AMD summit today; Kaveri cuts out the middle man in Trinity.

    No, it's definitely a huge part of the story. Do you honestly consider GK104 a mid-end GPU when the cards cost 499 bucks? GF104 (GTX 460) was launched at 199 and 229 MSRP for the 768 MB and 1024 MB version respectively: NVIDIA’s GeForce GTX 460: The $200 King. And this just in: GK107 sucks at...
  32. C

    AMD summit today; Kaveri cuts out the middle man in Trinity.

    What more is there to say about Kaveri?
  33. C

    AMD summit today; Kaveri cuts out the middle man in Trinity.

    You mean me personally? Firstly I'm concerned for AMD's future if they continue to cripple the CPU to include a bigger GPU, and cripple the GPU's graphics in an attempt to make it better at heterogeneous computing. Intel has strong CPU cores, strong homogeneous throughput computing, and a GPU...
  34. C

    AMD summit today; Kaveri cuts out the middle man in Trinity.

    I didn't say flagship Kepler chip, I said flagship consumer product. The point is that it makes no sense whatsoever to say that NVIDIA pursues "more GPGPU oriented devices" when leaving out the GTX 680 and GTX 690. And there is no indication that the lower models will have any better GPGPU...
  35. C

    AMD summit today; Kaveri cuts out the middle man in Trinity.

    Absolutely, but the problem is that improving double-precision support inherently sacrifices graphics performance for a given die size or power consumption. So it's no wonder that all APUs to date have no double-precision support. Kaveri isn't likely to change that if they want to hit the...
  36. C

    AMD summit today; Kaveri cuts out the middle man in Trinity.

    No. You can't apply supercomputing technology to a cell phone. The issue that supercomputers are facing is that the computing power per node is increasing faster than they can exchange data between them. So Echelon is all about low-latency high-bandwidth technology (caches and interconnections)...
  37. C

    AMD summit today; Kaveri cuts out the middle man in Trinity.

    You don't see why I'm concerned? Let's try this one more time. Where's the breakthrough in technology that can keep heterogeneous computing superior to homogeneous throughput computing? It's swell that the specification has been finalized, but why the secrecy? If HSA is supposed to be an open...
  38. C

    Use iPhone for PS3 bluetooth audio

    Do you happen to know if that's a hardware limitation, an (Apple) API limitation, or it just hasn't been programmed that way yet by anybody? I actually have a bit of Objective C programming experience, but I've never dealt with Bluetooth... Thanks for any pointers.
  39. C

    Use iPhone for PS3 bluetooth audio

    Good point but I prefer not to clutter my house with more stuff when I already have a Bluetooth audio device. It also seems like a waste of money if I can get it to work with my phone.
  40. C

    AMD summit today; Kaveri cuts out the middle man in Trinity.

    Not even close. Echelon is a research project for "extreme-scale" computing. A single node is specified to have 20 TFLOPS of computing power, and 256 GB of RAM. That's not going to be put into cell phones any time soon. It's clearly aimed only at the supercomputer market. I was talking about the...
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |