A Video Architechture Idea...

jpprod · Nov 17, 2000

Today, many sources are touting that the biggest problems designers of next generation graphics chipsets have to face is memory bandwidth. There are several techniques to combat this problem; however, all of them have design trade-offs. For example:
- Tiling. Greatly improves fillrate efficency by eliminationg overdraw and unnecessary writes to frame buffer. However, practical implementations have been complex (breaking scene into small tiles is not trivial), and had low peak fillrate which leads to such issues as sluggishness with multi-pass transparencies. Also some programs doing tricks with Z-buffer might have issues with video accelerators using tiling architecture.
- eDRAM. Certainly the "coolest" new technique which brought Bitboys to the scene is to dedicate a large amount of on-chip memory, eDRAM as frame- and Z-buffers. However, no implementations exist. Perhaps current process tehcnology doesn't allow for creating large enough eDRAM buffers. Also in highest of resolutions, this kind of solution has to partially fall back into traditional memory for extending frame buffer.
- various tricks (Z-culling, guardband clipping) to reduce unnecessary color/Z-reads and writes by hidden surface removal. Certainly a good idea with very few drawbacks, but isn't as efficient as aforementioned tiling in eliminating overdraw. Also, issues might arise if Z-value range is fully utilized or insufficent (speculation warning: could this explain Radeon's weird issues in some titles with 16bit color?)

But what if we took the best of these three and put it all in one chip, this way:
- Chip has a memory subsystem consisting of a sizeable, say 1-2MB eDRAM buffer and fast (200MHzish range) DDR SDRAM memory
- Chip breaks scene into very large tiles (256x256 or even bigger), as large as eDRAM buffer can host. Then the area of the tile is rendered, and result written into frame buffer residing in local DDR memory.
- Chip supports some form of crude HSR techniques

Tiling overhead would be low, since there are only very few tiles. Rendering engine could be "traditional", something in the lines of GeForce2 or Radeon with crude HSR support, and thus have the high peak fillrate normally not associated with tiling renderers. Compared to eDRAM frame buffer such as Bitboys's Glaze3D this chip would be easier to manufacture, since eDRAM doesn't have to be large enough to house full frame buffer.

This kind of three-way compromise sounds like a winner to me, please tell me what you think. Is there a flaw in my reasoning?

Of course "the" chip would also have pixel pipelines capable of taking and gaussian-weighted-blending eight subsamples (with programmable positions) at a single pass, but that's another story and not related to memory bandwidth compromises

pen^2 · Nov 17, 2000

damn jpprod, why dont you work for one of them companies! i wish i knew about video card tech 1/10 as much as you knew

fodd3r · Nov 17, 2000

well supposedly the verrite4400 which was the never released rendition chip had 12megs of embedded dram. the bit boyz are talking about 8 megs.

personally i think having a memory hieracry + efficient rendering would be better. let me explain.

-you have a gpu, on which you have some cache, usually it's not too big. i'm not sure how much vid cards have so lets assume (256kb)

now adding lots of cache is just plain expensive. so we take one step down and move to something more expensive.

-so we put in some sram. lets say we can put on 1 megs of embedded sram. that should be reasonable. 256bit double core speed due to ddr.

now since sram is one expensive and generates quite a bit of heat we can't have too much.

-so we throw down some edram. say 6 megs. 256bit bus at double core speed due to ddr.

now, the die is getting pretty full! so we have to go off die.

-now we throw down some 200mhz ddr ram with a 128 bit bus, enough to put the total amount of mem over or equal to 32megs or more.

with some schedualling, some good organization and compression you should be able to make use of this bandwidth, thus beating out much of the bottle neck.

also since you have so much low latency, high bandwidth memory, you can run a sorting/hsr engine. this should be able to get rid of many of the hidden surfaces. also you could use matrox's framebuffer tiling method. where the frame buffer is constructed in tiles (schedualling trick) to increase efficiency. this way you gain the overdraw reduction and keep compatibility.

fodd3r · Nov 18, 2000

you know what, this thread would belong better on www.beyond3d.com

this likely isn't the right crowd, jpprod.

BenSkywalker · Nov 18, 2000

Hmmm, well you can drop HSR support, frees up die space and would be somewhat useless if you are using tile based rendering anyway(redundant).

Die space is what I see as the biggest concern here(see below for why).

For 256x256 tiles I would say you would need at least 2MB eDRAM(anyone feel free to correct my math), but then what do you do about frame buffer? Also, as you increase geometric complexity your cache would need to grow or continue to render smaller and smaller times. Perhaps if you could handle some sort of HSR for geometry beforehand then you could keep it to a reasonable size(any ideas on how to?).

For a reasonable X-Box level tiler(I assume you know what I mean) I would think that you would want to have at least 4MB eDRAM, 8MB being a better option.

Another problem, how do you integrate the T&L unit with the cache to handle tiling? If you are clearing the cache after each tile you may end up having to recalculate geometry repeatedly. If you don't clear the cache then you will need it to be even larger.

Tilers have always had problems with large quantities of polys/vertices(and why Gigapixel's tech is supposed to be so revolutionary). Given a small enough build process, I think you could manage it all. With current build technology I'm not so sure.

Not trying to shoot you down Jukka, interested in hearing how this could work

vss1980 · Nov 18, 2000

Reasons this hasn't been done before or will be done for a while.... patents.

jpprod · Nov 18, 2000

I see your point about the need of a geometry buffer, Ben. OK, how about following modifications to the idea... Rendering on this chip would happen this way:
1. Traditional T&L, triangle setup done into a buffer residing within local DDR SDRAM memory. It's huge in size (64MB is becoming standard), and certainly capable of having entire screen geometry within it.
2. From the buffer, tiler gets vertex data and breaks it into as large tiles as there is eDRAM available (be that 1 or 2 MB). Tile is rendered into the eDRAM, and result written back into frame buffer (RGB + Z) within local DDR SDRAM memory.

For 500K polygons per scene, assuming some utilization strips&fans (worst case scenario IMO would be something in the lines of two vertexes per polygon) and running at 60FPS (that's 30Mpolys/s BTW) bandwidth requirement would be 720MB/s, and as data is both read and written, 1.4GB/S. 200MHz 128bit DDR SDRAM could certainly handle this along with textel fetches and frame buffer writes (tiles -> frame buffer).

BTW, why would HSR be useless in case of very large tiles? It would prevent overdraw within a tile, and thus boost real-world peak fillrate. If it adds significantly to die area dropping would of course be understandable, as this industry is all about making the right compromises

jpprod · Nov 18, 2000

BrotherMan; I'm pretty certain you're overestimating my knowledge of video chips - to tell the truth; I have very little idea about how practical, chip-level implementations of aforementioned techniques are. Unfortunately for the lot of you, that doesn't keep me from posting

BenSkywalker · Nov 18, 2000

"BTW, why would HSR be useless in case of very large tiles?"

You would double the bandwith needs of tiling AFAICT. You would need to buffer all of the data once to calculate HSR, write back, buffer again for tiling, write back. You would pretty much wipe out a good portion of your bandwith savings and gain very little in terms of rasterization efficiency, at least from what I can figure.

Hmmm, so what about overlapping geometry? You are still going to need to upload large portions of vertex data at least twice, this will cost you some additional bandwith. *Rough numbers*- figure for 500K per, 2 vertex per, your numbers may well go from 720MB/s to over 1GB/s with that level of geometry, with the additional bandwith needs increasing above and beyond that of a traditional rasterizer. The penalty may be even larger as you will need to calculate at least certain portions over again with the T&L unit, and it will have to be able to handle non static vertices-

When you break the scenes up into tiles you will be drawing out all of the polys needed for the scene, including those portions of polys not drawn. Unless you go with 32bit Z full time(and even then you may still run into problems), you will need to add additional calculations to assure you don't have "tearing" at the tile edges, I'm thinking of adding additional vertices at the edges of each tile(can you think of a better way?).

Sounds good so far

pen^2 · Nov 18, 2000

damn, didnt know you worked in the gaming industry... how cool is that starfight demo? brag about it to make me wanna download it

jpprod · Nov 18, 2000

damn, didnt know you worked in the gaming industry...

I'm not getting paid for what I do so I don't actually work in the gaming industry - my real work is web software developement with plSQL for Oracle databases. Much less interesting

how cool is that starfight demo? brag about it to make me wanna download it

I only say the following to brag it: it's not a demo, it's a full free game

You would double the bandwith needs of tiling AFAICT. You would need to buffer all of the data once to calculate HSR, write back, buffer again for tiling, write back.

True for advanced HRS techniques yes, propably there's no point in wasting bandwidth for this as it was the problem in the first place. However, some crude tricks done with Z-buffer, such as ATi's hyperZ* could improve rasterizing efficency without eating bandwidth or die space.

* Keeping an invariant of the smallest Z-value encountered in an area, and not filling polys whose vertexes are within this area, but beyond smallest Z-value. Or is HyperZ something more sophisticated?

Hmmm, so what about overlapping geometry? You are still going to need to upload large portions of vertex data at least twice, this will cost you some additional bandwith.

But in the 500kpolys/scene figure, isn't overlapping already being taken into consideration? If chip does T&L and puts all visible vertexes into the geometry buffer (including ones making overlapping polys, like passing 2D-data with Z-value for each vertex), I don't see a problem.

Unless DirectX8 D3D API is really fancy and abstracts some geometry clipping features for hardware to support directly, it's still going to be up-to software to do most of the overlapping geometry elimination. Polygons per scene is not going to mean visible polygons per scene for quite some time, though it's all really up to how to define the figure.

When you break the scenes up into tiles you will be drawing out all of the polys needed for the scene, including those portions of polys not drawn. Unless you go with 32bit Z full time(and even then you may still run into problems), you will need to add additional calculations to assure you don't have "tearing" at the tile edges, I'm thinking of adding additional vertices at the edges of each tile(can you think of a better way?).

I'm thinking tiler should render all polygons of which even a tiny part is within tile's area, and just not do textel fetches or filling for the parts which are not visible. This could lead into inefficency with large triangles, but at least no bandwidth would be wasted as no work would be done for the invisible parts. Of course another, perhaps better technique would be - as you suggested - to clip polygons using a slightly larger viewport than the tile itself.

Planning phase seems to be going rather fine on the "ideas" level, perhaps I could start a IP-company and give Finland's 3D-star, Bitboys OY some rough competition

fodd3r · Nov 19, 2000

me thinks going full tile rendering isn't worth it.

hyperz with framebuffer tiling would be better. you could get some hsr from the hyperz and with a the use of embedded memory and schedeulling, make effective use of memory bandwidth.

powervr and traditional renderers seem to be to messed up in one way or the other, the hyperz and some organization of scene work seem much more affective.

also, as the t&l unit (hardware/software) churns out it's data couldn't it be piped to a unit on the vid card the can do some hsr on the fly, with some cache + sram(it's non-destructive reads are a good thing when accessing the same data over and over again) it could clear out a fair bit of the overdraw?

BenSkywalker · Nov 19, 2000

Jukka-

"However, some crude tricks done with Z-buffer, such as ATi's hyperZ* could improve rasterizing efficency without eating bandwidth or die space."

HyperZ still eats some bandwith, Z data is stored.

"* Keeping an invariant of the smallest Z-value encountered in an area, and not filling polys whose vertexes are within this area, but beyond smallest Z-value. Or is HyperZ something more sophisticated?"

Not 100% sure what ATi is doing with HyperZ, they seem understandably tight lipped about exactly what they are doing. Mainly they only have information available on the basic aspects of it to date and I haven't found anyone who will *forget* they aren't supposed to say anything

"But in the 500kpolys/scene figure, isn't overlapping already being taken into consideration? If chip does T&L and puts all visible vertexes into the geometry buffer (including ones making overlapping polys, like passing 2D-data with Z-value for each vertex), I don't see a problem."

That relies on HSR coming before tiling though, wouldn't it?

"Unless DirectX8 D3D API is really fancy and abstracts some geometry clipping features for hardware to support directly, it's still going to be up-to software to do most of the overlapping geometry elimination."

I'm not sure on this, but it wouldn't surprise me if DX8 did have some advanced support for this, based on the NV20 HSR technology(only makes sense for the X-Box). Relying on software developers to deal with overdraw hasn't had the best results to date, I wouldn't be surprised(though I'm not sure) if they did include some sort of API/hardware clipping support.

fodd3r-

"hyperz with framebuffer tiling would be better. you could get some hsr from the hyperz and with a the use of embedded memory and schedeulling, make effective use of memory bandwidth."

To date, HyperZ seems to be extremely poor at doing what it is supposed to in most current games. The fact that ATi had a 15%-20% edge in bandwith over the GF2, and is performing at lower performance levels in bandwith limited situations has me extremely unimpressed so far.

This very well could be result of horrible ATi drivers(that would be a shocker), but I don't put much faith at all into the HyperZ technology at this point. This may have something to do with the fact that both 3dfx and nVidia have already figured out an alternate method of doing HyperZ type functions through drivers(Det3s and the latest V4/5 drivers with "precission" settings), of course that has not been stated in an official capacity that I know of yet. If nVidia and 3dfx can both emulate the same benefits through drivers without wasting die space, then that seems like a much better solution to me.

"also, as the t&l unit (hardware/software) churns out it's data couldn't it be piped to a unit on the vid card the can do some hsr on the fly, with some cache + sram(it's non-destructive reads are a good thing when accessing the same data over and over again) it could clear out a fair bit of the overdraw?"

Expensive, very expensive. You would need several MBs(4-8) of SRAM(anyone remember the price of 2MB Xeons?), you may well be looking at a several thousand dollar video card. He!! yeah, it would work great though

fodd3r · Nov 19, 2000

i don't think ati's hyperz is all that great since it's not fully operational as of yet. *mutter*

if it was i'd bet the radeon would spank the geforces. no comments please, it's my personal view of the facts and what i consider to be "good" performance.

what if each polygon was considered parallel to some arbitrary plane ? so each poly would be stored as a single vertex --x,y and z coordinate. now lets assume there are 5k polies in the frustrum. each of which are represented by 64bits x+y+z, i think that's right, that's 40kb --i'm worried i did something wrong with the calc. then you could merely work front to back sorting through the polies. removing only those that are fully covered.

the thing is being able to do this effciently is difficult. you could bin polies based on x,y coordinates and work through tiles, or you could just work through the entire scene at once. sorting for tiles would take too long and doing the entire thing at once probably wouldn't be to good either.

now before i continue lets build our theoretical video card. these are subjective averages of the rumoured specs of the nv20, radeon 2 and matrox g800.

-lets assume we have 4 pipes with 3 texturering units per.
-200mhz chip
-60 million triangles, with a max of 8 lights.
-256kb cache (lets assume there are no bandwidth limitations)
-1 meg of embedded sram 256bit ddr bus
-8 megs of embedded dram (the rendition verrite4400 had 12megs so 8 is reasonable) 256bit ddr bus
-32 megs 200mhz ddr mem, on pcb
-component X, this is the mystery unit that we can do a (fill in the blank) with.

i ask the contributers of the thread to use the above mentioned spec so we can have some reference.

now that we've defined our mystery card, on with my post.

you could organize each poly with a z (furthest to closest) value then sub sorted by their x (right to left) and then their y (top to bottom) value, so as the info comes in it's put into embedded dram, the key thing is that you want to have the information physically organized inside the dram in this manner as well as one can. one thing to note is the fact that as the info comes in, it's out of order, so you have to use a portion of your memory to link "disjointed" pieces of information, something like the for continued on page ______.

now once the info has poured in you can stream it in chunks --if need be-- to "unit x" which removes the hidden polies. now as you do this you also have to delete the polies in the actual zbuffer. i don't think that should be too big a deal, as long as you do some referencing.

i'm not sure if any of this will make sense, i should say that some of the inspiration of this idea came from a post on b3d.

bluemax · Nov 20, 2000

Slightly OT, but, hey jpprod- you need a musician? Experienced with MIDI and tracker, and any final output can always be tossed out to MP3.
I'll do a search for "starfight" and see what I can find...

jpprod · Nov 20, 2000

what if each polygon was considered parallel to some arbitrary plane ? so each poly would be stored as a single vertex --x,y and z coordinate. now lets assume there are 5k polies in the frustrum. each of which are represented by 64bits x+y+z, i think that's right, that's 40kb --i'm worried i did something wrong with the calc. then you could merely work front to back sorting through the polies. removing only those that are fully covered.

Isn't this in a sense how HyperZ works? It discards polygons which's vertexes are on a already filled area, and poly's coordinates are beyond largest encountered Z-value AFAIK. Sorting certainly would make HyperZ much more efficient.

BTW, the Z,X,Y sorting what you suggested could indeed boost geometry tiling algorithms. But considering the price of 1MB Xeon CPUs, I think the 1MB SRAM in the specs is still a tad over the top in cost.

Slightly OT, but, hey jpprod- you need a musician?

Unfortunately I'll have to answer no - two very talented musicians are already tied with JP-Production, and as of now it seems that StarFight VI will be my "exit stage left"-title. Thanks for asking, though

fodd3r · Nov 20, 2000

2 things.

first, this sram isn't running at the 450 mhz the xeon is pulling. second, the gamecube uses a much larger embedded sram pool. still over the top?

one last thing with tile rendering i hear there is a problem with alpha blending due to "state changes", mind elaborating on that?

jpprod · Nov 21, 2000

second, the gamecube uses a much larger embedded sram pool. still over the top?

Gamecube uses a completely new type of memory called 1T-SRAM which is essentially SRAM in it's nature, but only has one transistor per memory cell compared to normal SRAM which has three. Certainly a 1MB buffer could be done with 1T-SRAM; however, I'm not so sure about normal 200MHz range SRAM. Aside from price, it's power consumtion (and thus heat dissipation) is very high.

one last thing with tile rendering i hear there is a problem with alpha blending due to "state changes", mind elaborating on that?

I can't say for sure what state changes refers to when talking about alpha blending in conjuction with tile renderers. It could be that due to out-of-order nature of rendering taking place in tile architecture, handling transparent polygons is much trickier than in traditional renderes in which they're simply rendered last. However, as a bigger consern for alpha + current implementations of tile renderers I see their very low peak fillrate - one cannot eliminate overdraw and multiple frame buffer writes with transparent surfaces.

fodd3r · Nov 21, 2000

the gamecubes gpu runs at 202 mhz, i believe. the embedded sram is synced, to my knowledge. so the 200mhz 1t-sram is feasible. as for how many watts this thing will need, i'm not sure but a hybrid design could be done. where the embedded memory uses a slightly lower micron process.

thoughts and or suggestions?

A Video Architechture Idea...

jpprod

Platinum Member

pen^2

Banned

fodd3r

Member

fodd3r

Member

BenSkywalker

Diamond Member

vss1980

Platinum Member

jpprod

Platinum Member

jpprod

Platinum Member

BenSkywalker

Diamond Member

pen^2

Banned

jpprod

Platinum Member

fodd3r

Member

BenSkywalker

Diamond Member

fodd3r

Member

bluemax

Diamond Member

jpprod

Platinum Member

fodd3r

Member

jpprod

Platinum Member

fodd3r

Member

TRENDING THREADS