Various Wolfenstein II's Benchmarks

Page 6 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
May 11, 2008
20,260
1,150
126
Yet it can feed a 1080Ti or a Titan fast enough. Or, a null driver for that matter.

You keep making things up and posting them as facts, just to be proven wrong.

I do not know if it is relevant but i remember something about device context.
I vaguely remember reading a game programmer blog that a program making use of DX11 can create as many threads as it likes but only one cpu thread at a time can communicate with the gpu.
Synchronization of those threads is needed. That is kind of serializing all threads again.
 

Phynaz

Lifer
Mar 13, 2006
10,140
819
126
I do not know if it is relevant but i remember something about device context.
I vaguely remember reading a game programmer blog that a program making use of DX11 can create as many threads as it likes but only one cpu thread at a time can communicate with the gpu.
Synchronization of those threads is needed. That is kind of serializing all threads again.

Serialization is handled by the driver. If AMD's drivers can't feed their GPU's as fast as Nvidia's drivers can, that's AMD's issue.
 
May 11, 2008
20,260
1,150
126
Serialization is handled by the driver. If AMD's drivers can't feed their GPU's as fast as Nvidia's drivers can, that's AMD's issue.

I am pretty sure you are wrong about that.
And you are diverging of the dx11 subject.
I just looked it up, the game engine does, well actually dx11 does it by means of the dx11 device context.
And only one thread is allowed to communicate to the gpu through the driver at the time.
The detail is in what DX11devicecontext is capable off.
Immediate rendering or using deferred rendering.
When deferred rendering is used, the command lists are saved up and queued before issued to the gpu through the driver.
And then send with immediate rendering as one long list.
It is this deffered rendering that allows multithreading. But still only one thread at a time can communicate to the gpu.
That is what i understand of it.

As can be read here :
https://msdn.microsoft.com/en-us/library/windows/desktop/ff476891(v=vs.85).aspx
and here :
https://msdn.microsoft.com/en-us/library/windows/desktop/ff476892(v=vs.85).aspx

The reality of Nvidia having good scheduling in software is because they rely more on their driver as it gave them more flexibility in the sense that they have a lot more man power and available man hours than AMD have in the software department. I am pretty sure that is the case.
AMD designed a hardware solution and when the software is compliant reduces cpu overhead compared to a software solution.
That is the truth. But the other side of the story is that programs need to be specially written for it. Hence Mantle and then DX12 + vulcan came into play
I am pretty sure it is all an optimization problem.
Optimize for one or the other.
In the future i am sure the game developers when writing even just alone for x64 devices, have to write part of the engine as a module making maximum use of the available hardware from any of the 3 major gpu manufacturers.
The game engine will have some sort of abstraction or either 3 different binaries for the 3d rendering.

edit:
Forgot to note that even microsoft mentions that deferred rending causes considerable overhead.
 

TheELF

Diamond Member
Dec 22, 2012
4,026
753
126
edit:
Forgot to note that even microsoft mentions that deferred rending causes considerable overhead.
Isn't that the whole point?
Everybody says that mantle/dx12/vulkan uses more threads,on the cpu mind you,to do more graphics work and everybody is not only ok with that but sees it as the second coming.
So for us consumers,as long as it provides better FPS and smoother frametimes, why exactly would we need to care how they achieve it?
Everybody bitches about how games don't use enough CPU(threads) but when nvidia does something about it suddenly it's a bad thing...
 

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
Isn't that the whole point?
Everybody says that mantle/dx12/vulkan uses more threads,on the cpu mind you,to do more graphics work and everybody is not only ok with that but sees it as the second coming.
So for us consumers,as long as it provides better FPS and smoother frametimes, why exactly would we need to care how they achieve it?
Everybody bitches about how games don't use enough CPU(threads) but when nvidia does something about it suddenly it's a bad thing...

It's not just the fact that DX12/Vulkan can use more threads, it's that they can also do it with FAR less overhead than DX11. DX11 multithreading can use many threads as well, but the overhead quickly becomes unmanageable and it ends up using too much of the CPU.

AC Origins is like that I believe. While it scales fairly well on many cores/threads, the overhead is very high due to using DX11 and this is why even hex core CPUs come close to being tapped out.
 

dogen1

Senior member
Oct 14, 2014
739
40
91
AMD designed a hardware solution and when the software is compliant reduces cpu overhead compared to a software solution.

The difference is extremely negligible. Nvidia switched to software instruction scheduling because Kepler and later had fixed instruction latencies. They had no need for dynamic instruction scheduling. AMD doesn't either(on gcn iirc all instructions have a 4 cycle latency)... their instruction scheduling is just as static, it's just baked into the hardware instead.
 
Reactions: Carfax83
May 11, 2008
20,260
1,150
126
The difference is extremely negligible. Nvidia switched to software instruction scheduling because Kepler and later had fixed instruction latencies. They had no need for dynamic instruction scheduling. AMD doesn't either(on gcn iirc all instructions have a 4 cycle latency)... their instruction scheduling is just as static, it's just baked into the hardware instead.

I do not think it has anything to do with the instruction latency.
If i am not wrong, clock for clock both architectures are at the same execution speed.
If i am not mistaken, it has more to do with keeping all those alus fed and keep register pressure low enough to be able to execute shaders on those alus. Not so long ago someone posted a youtube video about a new game profiler and it showed in that particular example of how all resources were scheduled that the GCN gpu had problems in that particular case executing, because of not enough registers. If there are not enough registers, those executing units cannot be scheduled for use as there are no registers available to store to or retreive data from.

Perhaps because nvida uses warps of 32 and amd waves of 64 that nvidia can utilize their units better or at least easier, taking less effort. That is always the feeling i had.
Maybe AMD will change their architecture with navi to stop using 64 thread model and also go to 32 threads (for gaming gpu).
 
May 11, 2008
20,260
1,150
126
Isn't that the whole point?
Everybody says that mantle/dx12/vulkan uses more threads,on the cpu mind you,to do more graphics work and everybody is not only ok with that but sees it as the second coming.
So for us consumers,as long as it provides better FPS and smoother frametimes, why exactly would we need to care how they achieve it?
Everybody bitches about how games don't use enough CPU(threads) but when nvidia does something about it suddenly it's a bad thing...

That is what you read into it that what Nvidia did was bad.
I never found it a stupid idea.
It is just looking at what kind of hardware the market has and needs or either create the market and Nvidia executes very well in these disciplines.

For me it is interesting from a technical point of view. I am a nerd, it is bliss.

But the whole thing was that dx11 allows multihreading but still has to wait for synchronization of the threads when accessing the gpu through a signel thread.
Waiting is doing no work. The only way is when there is a lot of optimization done. When waiting, do other work.
Preventing the waiting as much as possible.
For as far as i know the newer graphic api's it is easier to do other work in parallel.
And that can help in making more use of more coars.
Of course, as many threads have started about the simple chicken and egg problem.
How can a game developer make use of moar coars when there are none.

We all show are dissatisfaction and slowly we see a rise in cpu cores.
I am not surprised when Nvidia comes with their own hardware scheduling optimized for gaming alone in the near future when it shows benefits.

AMD because of servicing so many markets that are either power or computation restraint is very creative in finding solutions.
It just makes sense that they try to sell what they have to as many markets as they can for a reasonable price.
What is wrong with that ?
Nothing, i say.
 

TheELF

Diamond Member
Dec 22, 2012
4,026
753
126
But the whole thing was that dx11 allows multihreading but still has to wait for synchronization of the threads when accessing the gpu through a signel thread.
Waiting is doing no work. The only way is when there is a lot of optimization done. When waiting, do other work.
Preventing the waiting as much as possible.
For as far as i know the newer graphic api's it is easier to do other work in parallel.

Why do you think it has to wait for synchronization of the threads when accessing the gpu through a single thread?
Each thread can finish it's workload at whatever time it does and add whatever data it came up with at the end of the single threads workload that talks to the gpu.
The only reason to synchronize the workloads is if you want everything contained in a frame to be displayed at the same time but more and more games seem to find that overrated and just don't do it...
Started with GTA V I believe but could be possibly even older,gta only displays part of the scene whenever it feels like it,agents of mayhem had this problem very intensely,and a lot of games in between did it as well.

(I have no idea if that is caused by a GPU or a CPU thing,but no syncing does happen in a lot of games)
 

dogen1

Senior member
Oct 14, 2014
739
40
91
I do not think it has anything to do with the instruction latency.
If i am not wrong, clock for clock both architectures are at the same execution speed.
If i am not mistaken, it has more to do with keeping all those alus fed and keep register pressure low enough to be able to execute shaders on those alus. Not so long ago someone posted a youtube video about a new game profiler and it showed in that particular example of how all resources were scheduled that the GCN gpu had problems in that particular case executing, because of not enough registers. If there are not enough registers, those executing units cannot be scheduled for use as there are no registers available to store to or retreive data from.

Perhaps because nvida uses warps of 32 and amd waves of 64 that nvidia can utilize their units better or at least easier, taking less effort. That is always the feeling i had.
Maybe AMD will change their architecture with navi to stop using 64 thread model and also go to 32 threads (for gaming gpu).

I'm actually not sure what you're trying to argue here. Can you explain your point a bit more directly? I'm not sure what you're responding to.

I mean.. iirc nvidia themselves explained the switch to software scheduling for the same reasons I said.
 
May 11, 2008
20,260
1,150
126
Why do you think it has to wait for synchronization of the threads when accessing the gpu through a single thread?
Each thread can finish it's workload at whatever time it does and add whatever data it came up with at the end of the single threads workload that talks to the gpu.
The only reason to synchronize the workloads is if you want everything contained in a frame to be displayed at the same time but more and more games seem to find that overrated and just don't do it...
Started with GTA V I believe but could be possibly even older,gta only displays part of the scene whenever it feels like it,agents of mayhem had this problem very intensely,and a lot of games in between did it as well.

(I have no idea if that is caused by a GPU or a CPU thing,but no syncing does happen in a lot of games)

Interesting.
That got me thinking, why are their only parts of the screen shown ?
Is it because the game starts so many threads in parallel on the cpu but having not enough cpu cores to finish in time to render a frame or /and has a fallback mechanism to just start rendering when cirtical timing for a frame update cannot be met ?
 
May 11, 2008
20,260
1,150
126
I'm actually not sure what you're trying to argue here. Can you explain your point a bit more directly? I'm not sure what you're responding to.

I mean.. iirc nvidia themselves explained the switch to software scheduling for the same reasons I said.

I mean, in the specific example above , the game makes use of shaders that are not optimized to make maximum use of the available register resources the SIMD units have in the gpu.
The shaders are converted to instructions that the gpu understands but they are written in such a way that all the register were used up. So, not all available alus in the SIMD can be used.
At least that is what i got from it.
Maybe other wavefronts can be started but nothing comes for free and everything takes time introducing latency.

And i think that kind of fits with the reality that both Nvidia and AMD replace shaders in their drivers.
I am sure this is a way to alleviate register use pressure or sorting instructions in a different way to make maximum use of the ALU.
Or changing program flow like if statements.

This blog gives a good example of what happens when a warp is used that cannot be fully utilized.
https://blogs.msdn.microsoft.com/nativeconcurrency/2012/03/26/warp-or-wavefront-of-gpu-threads/
It is fascinating.

When programming GPUs we know that we typically schedule many 1000s of threads and we also know that we can further organize them in many tiles of threads.

Aside: These concepts also exist in other programming models, so in HLSL they are called “threads” and “thread groups”. In CUDA they are called “CUDA threads” and “thread blocks”. In OpenCL they are called “work items” and “work groups”. But we’ll stick with the C++ AMP terms of “threads” and “tiles (of threads)”.

From a correctness perspective and from a programming model concepts perspective, that is the end of the story.

The hardware scheduling unit
However, from a performance perspective, it is interesting to know that the hardware has an additional bunching of threads which in NVIDIA hardware is called a “warp”, in AMD hardware is called a “wavefront”, and in other hardware that at the time of writing is not available on the market, it will probably be called something else. If I had it my way, they would be called a “team” of threads, but I lost that battle.

A “warp” (or “wavefront”) is the most basic unit of scheduling of the NVIDIA (or AMD) GPU. Other equivalent definitions include: “is the smallest executable unit of code” OR “processes a single instruction over all of the threads in it at the same time” OR “is the minimum size of the data processed in SIMD fashion”.

A “warp” currently consists of 32 threads on NVIDIA hardware. A “wavefront” currently consists of 64 threads in AMD hardware. Each vendor may decide to change that, since this whole concept is literally an implementation detail, and new hardware vendors may decide to come up with other sizes.

Note that on CPU hardware this concept of most basic level of parallelism is often called a “vector width” (for example when using the SSE instructions on Intel and AMD processors). The vector width is characterized by the total number of bits in it, which you can populate, e.g. with a given number of floats, or a given number of doubles. The upper limits of CPU vector widths is currently lower than GPU hardware.

So without going to any undesirable extremes of tying your implementation to a specific card, or specific family of cards or a hardware vendor’s cards, how can you easily use this information?

Avoid having diverged warps/wavefronts
Note: below every occurrence of the term “warp” can be replaced with the term “wavefront” and the meaning of the text will not change. I am just using the shorter of the two terms .

All the threads in a warp execute the same instruction in lock-step, the only difference being the data that they operate on in that instruction. So if your code does anything that causes the threads in a warp to be unable to execute the same instruction, then some threads in the warp will be diverged during the execution of that instruction. So you’d be leaving some compute power on the table.

Obvious examples of divergence are

  1. When you pick a tile size that is not a multiple of the warp size.
  2. When you have branching where the whole warp is unable to take the same branch. Note that beyond the very obvious if statement, a for loop or the C++ short-circuit evaluation or the tertiary operator can also result in branching.
Consider this simple line of code, in some restrict(amp) code, working on 1-dimensional data, with an index<1> variable named idx, essentially representing the thread ID

Code:
if (idx[0] >= XYZ)
{
   my_array[idx] += A;
}
else
{
   my_array [idx] *= B;
}


If XYZ is equal to or a multiple of the warp size, then no divergence occurs. So while your code has branching, from a warp divergence perspective there is no harm and all threads are kept busy all the time. Good.

If XYZ is smaller than the warp size OR it is larger but not a multiple of the warp size, then divergence will occur. For example, if warp is of an imaginary size 4, and XYZ is 7, consider what happens during the execution of the first two warps.

The first warp of 4 threads (with idx[0] of 0,1,2,3) all evaluate the conditional in one cycle and it is false for all of them so all threads take the else clause, where in a single cycle they all multiply their respective element in the array by B , which they access with their thread id, i.e. idx. No problem.

The second warp of 4 threads (with idx[0] of 4,5,6,7) all evaluate the conditional and it is false for the first 3 but true for the last one. So at that point the first three threads proceed to add A to their respective element in the array BUT the fourth thread is idling at that point since it is diverged so you are essentially wasting it. The next thing that happens is that the else path is taken by the fourth thread so it can add B to its array element BUT now the other 3 threads are diverged at this point, so you are wasting the opportunity of having them do work during that cycle. Bad.

Now, there are many scenarios where warp divergence is unavoidable, and you have to balance the tradeoffs between this and other considerations. However, it is good to know that if you have the choice, for today’s hardware, starting with a tile size of a multiple of 64 and ensuring that conditionals and loops would not diverge threads at the 64 boundary, is going to result in better thread utilization.

To be clear, there are many other considerations for fully utilizing your hardware, but this is just one fairly easy one to take care of; for example, perhaps you can pre-sort your data, such that close-by threads follow the same code paths...
 
Last edited:

dogen1

Senior member
Oct 14, 2014
739
40
91
I mean, in the specific example above , the game makes use of shaders that are not optimized to make maximum use of the available register resources the SIMD units have in the gpu.
The shaders are converted to instructions that the gpu understands but they are written in such a way that all the register were used up. So, not all available alus in the SIMD can be used.
At least that is what i got from it.
Maybe other wavefronts can be started but nothing comes for free and everything takes time introducing latency.

And i think that kind of fits with the reality that both Nvidia and AMD replace shaders in their drivers.
I am sure this is a way to alleviate register use pressure or sorting instructions in a different way to make maximum use of the ALU.
Or changing program flow like if statements.

This blog gives a good example of what happens when a warp is used that cannot be fully utilized.
https://blogs.msdn.microsoft.com/nativeconcurrency/2012/03/26/warp-or-wavefront-of-gpu-threads/
It is fascinating.

Yeah that's a completely separate subject. I have no idea why you responded to me.
 

TheELF

Diamond Member
Dec 22, 2012
4,026
753
126
Interesting.
That got me thinking, why are their only parts of the screen shown ?
Is it because the game starts so many threads in parallel on the cpu but having not enough cpu cores to finish in time to render a frame or /and has a fallback mechanism to just start rendering when cirtical timing for a frame update cannot be met ?
Look at this,this is about sending part of the workload to a igpu under the new apus but it's still valid,the same thing happens in a lot of games they just send all the work to the same gpu,several threads do different work you could have a thread for walls one for trees and bushes one for cars and so on,because it's impossible to make a game where every one of these components is used the exact same amount throughout the whole game it means that one thread or the other is going to finish earlier than others,so either you use synchronization and loose a lot of performance,or you just display everything the moment it get's ready.
Sadly because the consoles are so weak option one is not an option...
https://youtu.be/9cvmDjVYSNk?t=227
 
May 11, 2008
20,260
1,150
126
Look at this,this is about sending part of the workload to a igpu under the new apus but it's still valid,the same thing happens in a lot of games they just send all the work to the same gpu,several threads do different work you could have a thread for walls one for trees and bushes one for cars and so on,because it's impossible to make a game where every one of these components is used the exact same amount throughout the whole game it means that one thread or the other is going to finish earlier than others,so either you use synchronization and loose a lot of performance,or you just display everything the moment it get's ready.
Sadly because the consoles are so weak option one is not an option...
https://youtu.be/9cvmDjVYSNk?t=227

Another option would be to have a more forward looking engine that is not only keeping track of what is needed for the current frame but also looks ahead for the next frame. Looking one frame ahead should be enough. When a thread is finished, but the engine knows the thread is needed again in the next frame, it can already start to work on preparing data for the next frame.
That is handy with having the DMA copy queues. Data can be copied over to GPU memory in preperation for the gpu to process it while the execution units in the gpu are working on the data for the current frame. It is difficult but should increase utilization.
 

urvile

Golden Member
Aug 3, 2017
1,575
474
96
I was messing around with geforce experience and msi afterburner so......this is my contribution gtx1080ti, 3440x1440, mein leben! graphics settings.

 

TheELF

Diamond Member
Dec 22, 2012
4,026
753
126
Another option would be to have a more forward looking engine that is not only keeping track of what is needed for the current frame but also looks ahead for the next frame. Looking one frame ahead should be enough. When a thread is finished, but the engine knows the thread is needed again in the next frame, it can already start to work on preparing data for the next frame.
That is handy with having the DMA copy queues. Data can be copied over to GPU memory in preperation for the gpu to process it while the execution units in the gpu are working on the data for the current frame. It is difficult but should increase utilization.
So how would that change anything?
The game still has to decide when to show stuff on screen,show it when the whole frame is ready or show every part whenever it is ready,if the thread then goes on to start calculating stuff for the next frame is irrelevant.
 
May 11, 2008
20,260
1,150
126
So how would that change anything?
The game still has to decide when to show stuff on screen,show it when the whole frame is ready or show every part whenever it is ready,if the thread then goes on to start calculating stuff for the next frame is irrelevant.

See my post before.
 

PowerK

Member
May 29, 2012
158
7
91
I haven't played or bought the game yet. But reading around several communities, it seems this game is a mess in technical point of view... based exclusively on Vulkan and optimized for one IHV.
I guess low level APIs such as Vulkan and DX12 are great when you have a fixed set of hardware (consoles). However, on PCs where there are lots of different combinations of CPUs, motherboards and GPUs, high level API may be better suited.

I immensely enjoyed some of DX12 exclusive titles on PC such as Gears of War 4, Forza Horizon 3 and Forza Motorsport 7. But I can't help wondering... perhaps, these titles may actually perform and work better on PC if they were based on the high level API for PC that we're familiar with (DX11).
 
Last edited:

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
I haven't played or bought the game yet. But reading around several communities, it seems this game is a mess in technical point of view... based exclusively on Vulkan and optimized for one IHV.
I guess low level APIs such as Vulkan and DX12 are great when you have a fixed set of hardware (consoles). However, on PCs where there are lots of different combinations of CPUs, motherboards and GPUs, high level API may be better suited.

I don't think optimizing for multiple IHVs is an issue for a proficient developer such as Id Tech. The problem was that AMD's contract with Bethesda had Id Tech and Machine Games focus on optimizing for GCN (particularly Vega) during the development phase, with NVidia being essentially left out. In fact, NVidia never even had launch day drivers ready in time for the game's release, which is unusual. I do believe however that the game's performance on NVidia hardware will increase substantially in the coming months as they start to implement deeper optimizations for NVidia, much like what occurred with Doom after the Vulkan renderer was released.

I immensely enjoyed some of DX12 exclusive titles on PC such as Gears of War 4, Forza Horizon 3 and Forza Motorsport 7. But I can't help wondering... perhaps, these titles may actually perform and work better on PC if they were based on the high level API for PC that we're familiar with (DX11).

Like I said before, it all comes to down to how skillfully implemented the DX12/Vulkan implementation is. Since they are both essentially new APIs and apparently much more complex than DX11, there is a steep learning curve. We've seen this time and time again with several titles gaining massive performance over time due to patches and or driver updates, the most notable being Forza Horizon 3 which practically doubled performance in some cases.

That said, I would prefer low level APIs every single time. When you play Wolfenstein 2 and see the level of performance and the kind of effects this game has, it truly is a massive leap over what was possible before with DX11. I mean, 30,000+ draw calls per frame is insane for an first person shooter that can hit over 200 FPS!
 

Det0x

Golden Member
Sep 11, 2014
1,253
3,952
136
Last edited:

tg2708

Senior member
May 23, 2013
687
20
81
Not sure if driver related but game locks up when I die and press restart which is annoying.
 

urvile

Golden Member
Aug 3, 2017
1,575
474
96
Still messing with afterburner a bit. Also once get better cooling for the TR I am going to overclock that sucker.

 
Reactions: flash-gordon
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |