Originally posted by: BenSkywalker
You may save some of the overhead from using multiple passes or other tricks in SM2.0, but you can't magically do math in half the number of clock cycles. At least not in the general case.
Maybe not in the general case but unrolling loops and executing every possible branch is certainly a lot more computationaly intensive then handling it with the least possible amount of steps.
???
That is not at all what I meant. If you have to compute some mathematical function, the number of mathematical operations required to compute it
does not change going from SM2.0 to SM3.0. The only thing that will change is the overhead -- and in many cases, this is only a small fraction of the total number of operations. In fact, as individual shaders get longer, the percentage of time spent on 'overhead' gets shorter.
Is there extra overhead associated with long, complex shaders in SM2.0 compared to SM3.0? Undoubtedly. I'm sure you can write programs that will run twice as fast in SM3.0, because an SM2.0 implementation requires jumping through a dozen hoops that don't exist in SM3.0. But in any shader where the actual time doing "real work" is large compared to the overhead, SM3.0 will make little to no difference.
Consider me skeptical of the improvement that could be gained from doing this. You're talking about very general changes, then just waving your hands and saying it will run "much better".
Take multiple light interactions on a given surface with a shader. Using SM 2.0 if you are running into multiple lights you need to recalculate the shader for each light light interaction as opposed to calculating out how that particular light interaction is going to impact an already executed shader.
If you have to handle an arbitrary number of lights, then yes, this is a huge PITA to do with SM2.0 (although ATI's SM2.0b extensions have tools for dealing with this sort of situation). If you set a fixed maximum number of lights, and the instruction count per light source is not ridiculous, you can write multiple SM2.0 shaders (e.g. a shader to handle one light, a shader to handle two lights, etc.) to do this without incurring noticeable performance losses.
Is the SM3.0 code simpler? Sure. Can you make SM2.0 do this just as fast as SM3.0? Usually, yes.
Again, while you may reduce the "length" of the shader by using loops/branches, etc., you are not necessarily reducing the number of instructions executed by the GPU to run the shader. By 'complexity' I mean the computational complexity of running the shader, not necessarily the number of instructions needed to express it.
Could you explain how you can use loops, branches and collapse passes and
not reduce the computational complexity? I can't think of a single example where you can do all of the former and not reduce overhead considerably.
Introducing a lot of branch instructions into a shader can
increase the total instruction count and/or execution time (especially if branch instructions are significantly more expensive than other instructions). Hence why loop unrolling is used as an optimization technique. If you know in advance how many passes you need, and you can fit the whole thing into a single linear shader program, you don't *need* the loop at all. It makes coding easier, but actually slows execution down.
Let's say you have some graphics code that does something (conceptually) along the lines of the following in SM2.0:
switch (num_of_lights)
{
case 1:
execute ShaderCalcOneLight(light_data);
case 2:
execute ShaderCalcTwoLights(light_data);
case 3:
execute ShaderCalcThreeLights(light_data);
case 4:
execute ShaderCalcFourLights(light_data);
default:
/* do stuff to handle lights in multiple passes */
}
if (num_of_lights > 1)
execute Shader2;
else
execute Shader3;
In SM3.0, you can write a single shader that looks like this:
for (count = 0; count < num_of_lights; count++)
{
/* code to calculate the contribution of one light */
}
if (num_of_lights > 1)
{
/* code that used to be in Shader2 */
}
else
{
/* code that used to be in Shader3 */
}
This is not necessarily a whole lot faster. Essentially all you save is the overhead of having two shader execution calls rather than one. In fact, the first part might actually be slower in SM3.0 for 1-4 repetitions, since the SM2.0 version doesn't have to use any dynamic branch instructions (again, depending on how slow such instructions are on a particular card).
While yes, the 7800GTX can do 24 pixel and 8 vertex shader ops/clock, it still maintains a comparatively large amount of fixed-function rendering capability.
As a rasterizer GPUs are still going to need a large amount of fixed function hardware and as always everything that can remain fixed function in hardware should- it is orders of magnitude faster then flexible hardware.
Okay, but in your last post you alluded to things such as using shaders to replace (or at least supplement) surface textures. If you want to shift to that sort of model (where you use shaders everywhere), you simply don't
need as much fixed-function hardware, and you would be better off having more transistors devoted to programmable elements. Current PC graphics cards are still maintaining both in roughly equal amounts.
I'm talking about something that has, say, 16 FF pipelines but 64 pixel/vertex shaders (or something along those lines). Clearly, if you want to move away from super-high-res textures and triangle counts towards vastly improved vertex and pixel shaders, you will eventually need hardware that is built with that kind of engine design in mind (or else your transistor counts are just going to explode -- not that they haven't already with NVIDIA's last two generations of hardware). Next-gen console hardware seems to be heading in this direction already.
You do realize the last part nVidia offered without programmable hardware was the TNT2Ultra? It isn't like basic rasterization has been driving transistor counts for the last four years- it has been programmable units and that trend isn't stopping.
To some extent, yes, but the ratio of transistors devoted to shaders and fixed-function rasterization hasn't shifted all that much. A 16-pipe card with 16 pixel shaders and 6 vertex shaders is fundamentally just a 'bigger' version of an 8-pipe card with 8 pixel shaders and 3 vertex shaders. The 7800GTX changes this somewhat by only having 16 ROPs for its 24 pipelines, but fundamentally the architecture remains the same.
The problem, as I have stated before, is that currently, the cards that have SM3.0 cost more than their non-SM3.0 equivalents (e.g. X800XL versus 6800GT), or cost about the same/slightly less but perform worse overall (e.g. 6600GT versus X800). Perhaps it would have been clearer if I had said that I do not believe SM3.0 should be the only thing you base a graphics card purchase on today.
So which cards are you going to use as an example? The 6800GT is faster then the X800XL and costs more- the X800 is faster then the 6600GT and on a percentage basis carries a larger premium then the 6800GT does over the X800XL.
Um, I thought I was pretty clear.
The 6800GT is (slightly) faster than the X800XL (and slower at a few things, like HL2), and has SM3.0, but is significantly more expensive. In this case, you are paying a premium pretty much just for SM3.0 (and better performance in Doom3, I guess).
The 6600GT is (noticeably) slower than the X800, but only costs slightly less. The X800 is more cost-effective despite not having SM3.0.