390 and Fury X scaling well with the Celeron
980 Ti gains something, but the other Nvidia cards!?
http://pclab.pl/art67995-15.html
980 Ti gains something, but the other Nvidia cards!?
http://pclab.pl/art67995-15.html
Lol, even the fury x looses performance with async on at 1080!!!
You know what im seeing there? something that does not worth expending an extra cent on developing anything.
...
...
Just too much over hype on a feature used in a crappy alpha game, its like taking Star Citizen as a example of anything.
But I do see your point. AMD went for the sabotage by giving Mantle away for free to Apple, Kronos and Microsoft so that it's iterated upon and forked into Metal, Vulkan and DX12.
390 and Fury X scaling well with the Celeron
980 Ti gains something, but the other Nvidia cards!?
http://pclab.pl/art67995-15.html
Lol, even the fury x looses performance with async on at 1080!!!
I guess they don't support async in hardware either...
This has been demonstrated since mantle.also i saw this photo today..
is this for real? dx12 actually likes logical cores or threads?
So?If you took a few seconds to look into the game, most of the dynamic lights and compute features are on the higher settings, like Crazy & Extreme and disabled on High. -_-
The lighter the compute workloads, the less useful Async Compute is, should be simple to understand.
The other nice thing about dx12 that is obvious from the graph below is how nicely it scales with cores. Those FX6xxx/8xxx just love it.
In dx11 games we see fx6300 tied or even loosing to fx4300 due to frequency difference. The 50% extra cores give no benefit most of the time. This pattern was broken recenty in some games thanks to new consoles, but still, a common theme.
DX12 takes multithreaded rendering to 11(pun)! and fx6300 is 30%+ faster than a fx4300
So?
Still, how do you loose performance with Async turned on when you have hardware Async support?Wasn't that the main argument in this threat to "prove" that nvidia doesn't support it?
Instead now it is shown that the FuryX can manage ~75FPS at 1080,with a lot of shaders idle, and async only makes things slower if you are at full speed already.
Unfortunately it's not that clear since you drop from 70+ to 60+Once you add enough work to move the bottleneck though, it shows pretty clearly whether the card has async compute or not.
I'm assuming that you really want to know and not arguing incessantly in defence of Nvidia.Unfortunately it's not that clear since you drop from 70+ to 60+
If furyX would stay at 75fps with the compute added then yes it would be clear but now the only clear thing is that AMD has more compute throughput then nvidia.
AMD
Compute engines can be used for multiple different purposes on GCN hardware:
Long running compute jobs can be offloaded to a compute queue. If a job is known to be possibly wasting a lot of time in stalls, it can be outsourced from busy queues. This comes at the benefit of achieving better shader utilization as 3D and compute workload can be interleaved on every level in hardware, from the scheduler down to actual execution on the compute units. High priority jobs can be scheduled to a dedicated compute queue. They will go into the next free execution slot on the corresponding ACE. They can not preempt running shaders, but they will skip any queued ones. Make proper use of the priority setting on the compute queue to achieve this behaviour. Get around the execution slot limit. When executing compute shaders with tiny grids, down to the minimum of 64 threads per thread group, you would under utilize the GPU using only a single engine. By utilizing all 8 ACE units together with the 3D Engine, you can achieve up to 640 active grids on Fiji. This is precisely the upper occupation limit and maximizes utilization, even if each grid only yields a single wavefront. You should still prefer issuing less commands with larger grids instead. Pushing the hardware to the limits like this can expose other unexpected bottlenecks. Create more back pressure. By providing additional jobs on a compute engine, the impact of blocking barriers in other queues can be avoided. Barriers or fences placed on other queues are not causing any interference. GCN is still perfectly happy to accept compute commands in the 3D queue.
There is no penalty for mixing draw calls and compute commands in the 3D queue. In fact, compute commands have approximately the same performance as draw calls with proxy geometry10.
Compute commands should still be preferred for any non-geometry related operation for practical reasons, such as utilizing the local shared memory and increasing possible concurrency.
Offloading compute commands to the compute queue is a good chance to increase GPU utilization.
10 Proxy geometry refers to a technique where you are using a simple geometry, like a single screen filling square, to apply post processing effects and alike to 2D buffers.
Nvidia
Due to the possible performance penalties from using compute commands concurrently with draw calls, compute queues should mostly be used to offload and execute compute commands in batch.
There are multiple points to consider when doing this:
The workload on a single queue should always be sufficient to fully utilize the GPU. There is no parallelism between the 3D and the compute engine so you should not try to split workload between regular draw calls and compute commands arbitrarily. Make sure to always properly batch both draw calls and compute commands. Pay close attention not to stall the GPU with solitary compute jobs limited by texture sample rate, memory latency or anything alike. Other queues can't become active as long as such a command is running. Compute commands should not be scheduled on the 3D queue. Doing so will hurt the performance measurably. The 3D engine does not only enforce sequential execution, but the reconfiguration of the SMM units will impair performance even further. Consider the use of a draw call with a proxy geometry instead when batching and offloading is not an option for you. This will still save you a few microseconds as opposed to interleaving a compute command. Make 3D and compute sections long enough. Switching between compute and 3D queues results in a full flush of all pipelines. The GPU should have spent enough time in one mode to justify the penalty for switching. Beware that there is no active preemption, a long running shader in either engine will stall the transition. Despite the limitations, the use of compute shaders should still be considered. The reduced overhead and effectively higher level of concurrency compared to classic draw calls with proxy geometry can still yield remarkable performance gains.
Additional care is required to cleanly separate the render pipeline into batches.
If async compute with support for high priority jobs and independent scheduling is a hard requirement, consider the use of CUDA for these jobs instead of the DX12 API.
With GK110 and later, CUDA bypasses the graphics command processor and is handled by a dedicated function unit in hardware which runs uncoupled from the regular compute or graphics engine. It even supports multiple asynchronous queues iin hardware as you would expect.
Ask your personal Nvidia engineer for how to share GPU side buffers between DX12 and CUDA.
GM200 is a better DX12 card than Fiji. GM204 is beating Tonga without problems. AMD has no advantage with DX12.
Looks like that AMD needs to catch up with them. D:
HSA? Is this not this thing nobody cares about?!
Im sorry for refreshing this post, and quoting myself but...Sheer compute horsepower on Fiji is much higher than anything Nvidia offers.
That alone will make gigantic difference. The matter is this. Proper implementation of DX12 will make R9 390X on par with GTX 980 Ti. Just because both cards have so similar compute power. And GTX 980 Ti is not able to execute context switching properly. Other thing that Fiji is fundamentally flawed from design point of view, but still, it is capable of delivering. It will never however achieve its true, full potential.
About HSA. Everybody cares about HSA. If they would not they would completely ignore it. But they don't(mobile vendors, Apple, AMD, Nvidia, Intel - everybody does something about it).
Are you sure?
P.S. I was referring to GTX 980 Ti - R9 390X comparison.
P.S. R9 380X will not be able to get to GTX 980 levels of performance because it has much lower core clock(970 MHz vs over 1000, I do not know the nominal numbers for GTX 980). But the difference in performance will reflect just that margin. That is however - in future.
What's wrong is that NV doesn't support Asynchronous compute. All they're getting is the low overhead and it seems they don't really need in that this title.GM200 doesnt scale over GM204 or GM206. The DX12 path is still unoptimized for them.
GM204 is still faster than Tonga:
http://www.guru3d.com/articles_pages/ashes_of_singularity_directx_12_benchmark_ii_review,7.html
Especially under the "Crazy" setting there is no improvement with DX12 over DX11 which shows that there is something very wrong:
http://pclab.pl/art67995-13.html
Looks promising for Radeon GPU's, but this is still in beta and is using last gen tech.
The true story will be told at the end of the year when this benchmark may actually mean something when real DX12 GPU's are in the market....not just DX12 'capable' GPU's.
There seems to be a Frametime issue with AMD cards : http://www.guru3d.com/articles-pages/ashes-of-singularity-directx-12-benchmark-ii-review,10.html
There's also some kind of mandatory DX12 feature not supported by AMD's actual drivers.
Pcper is looking into this at the moment.