AMD's next GPU uarch is called "Polaris"

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

ThatBuzzkiller

Golden Member
Nov 14, 2014
1,120
260
136
It's not like they just trash their old designs and start over from scratch. Once they tweak GCN enough, it makes sense to give it a new name, just for the marketing boost.

Personal wishlist: more registers per work unit, dismantling the existing geometry/tesselation and doing it all distributed on the CUs (with necessary tweaks to make that fast).

More registers per thread is certainly a plus when it comes to increasing occupancy but lowering their wavefront size from 64 to 16 to match the vector unit width would do wonders for them since it reduces the register file by a factor of 4x! Around a billion transistors are used to create JUST the register file alone on Fiji. Naively increasing the size of the register file would increase latencies which lowers the clocks potentially leading to lower performance ...

I wouldn't cut out those fixed function units yet when their more efficient from a power or silicon area perspective and that especially goes for tessellation units which are extremely hard to parallelize. The next stage I would tackle on getting more flexible is blending by attempting to make primitive ordering cheap (as much like mobile GPUs without the sorting nonsense) and then giving direct access to depth/stencil buffers from shaders ...
 

buletaja

Member
Jul 1, 2013
80
0
66
you should see Mike Mantor paper

there is no coincidence he recently awarded by AMD
http://www.hpcwire.com/off-the-wire...architect-michael-mantor-to-corporate-fellow/
http://ir.amd.com/phoenix.zhtml?c=74093&p=RssLanding&cat=news&id=2118863

he is a father of X360 GPU SC design + Xbox One
no wonder he put that as reminder




and this is the paper



basically
you have 2 kind vector unit
rather than as current GCN scalar + SIMD

now
Scalar is detached and has its own PC
and made from 1-2 wide vector
Vector unit still the same as before and probably will less wide

with this modification
sure if DX12 start to land and usage properly
this will happen
read between the lines



rermember GPU normally not intended for irregualarity
so better to modify the arch to both kind of workload
gpgpu = which tend to be irregular --> flexible scalar
gfx = which tend to be regular --> prev SIMD

Intel - AMD - Nvidia since 2010 with MS
already research on this, it is us that sometimes forget
they doing this and make paper etc

this is Intel Slide (2010-2011)
read the text at bottom !!!
=====================


It is also no coincidence
that X1 is multicontext
X1 also the first to use GPUMMU (per process, IOMMU is exposed per device not per process)
(X1 has 8 Gfx Context, all modern GPU from Host POV is always 1 Gfx context)
also has Onion3 BW
has same Video Block BW with upcoming polaris video block = 4 GB/sec ~ 32 Gbps
 
Last edited:

buletaja

Member
Jul 1, 2013
80
0
66
Then you also should see
they try this kind of new paradigm since X360
but PC not catching up with the paradigm

good article back then from
http://arstechnica.com/features/2005/05/xbox360-1/


so MS need to pushed the entire ecosystem into streaming model
which is funnily is actually why X360 designed like that.

all also because DX12 is explicitly move the GPU arch into streaming model
if you model the GPU like that, you need Beefier frontend that also why
DX12 need powerfull CPU, need to utilize all cpu cores if available

but that is just stop gap solution,
the better one is improving the CP (probably CP become CPU like core)



thats why you see the ACE core intead just ACE, or X1 support 16 stream
The Funny thing is Fury has 4 ACE, it is not ace but ACE core!!
per ACE core support up to 16 stream.


DX11 non streaming model


DX12 is streaming model, means need Better CP or need powerfull CPU in the frontend
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,422
1,759
136
More registers per thread is certainly a plus when it comes to increasing occupancy but lowering their wavefront size from 64 to 16 to match the vector unit width would do wonders for them since it reduces the register file by a factor of 4x! Around a billion transistors are used to create JUST the register file alone on Fiji. Naively increasing the size of the register file would increase latencies which lowers the clocks potentially leading to lower performance ...
You can't just cut the work unit size like that, because amd uses a trick where the register file is actually split into 4 separate register files for the 4 segments of the 64-unit wavefront, each of which has only a single r/w port from which the operands are read over 4 clocks. 4x single ported register file is much cheaper than a single 4-ported register file, even if you make it a lot smaller.
 

JDG1980

Golden Member
Jul 18, 2013
1,663
570
136
That's weird, and the first that I've heard of Arctic Islands being non-GCN. Unless Polaris is just GCN 2.0 .

It's important to remember that the designations of GCN 1.0, 1.1, and 1.2 were made up by review sites, and don't officially come from AMD. The official AMD slides refer to the Hawaii/Kaveri/console architecture ("GCN 1.1") as "2nd generation GCN" and the Tonga/Carrizo/Fiji architecture ("GCN 1.2") as "3rd generation GCN".

I very much doubt that Polaris will be a clean-sheet redesign the way that Zen will be on the CPU front. As Techhog alluded to, the last time Nvidia did a full clean-sheet redesign was Fermi. Kepler was a modified refinement of Fermi, and Maxwell was a modified refinement of Kepler. And the chances are good that Pascal will be an updated, die-shrunk version of Maxwell. This is despite the fact that Nvidia has considerably more resources than AMD and fewer places to spend them on.

Unlike AMD's construction core CPUs, the GCN architecture isn't so fundamentally flawed that it has to be completely thrown out. Updates, optimizations, and refinements should be adequate to make a competitive product.
 

DrMrLordX

Lifer
Apr 27, 2000
21,808
11,164
136
That makes sense, and overall it would be better for AMD to stick with GCN for awhile yet. They still haven't unlocked the full potential of that uarch in my opinion, never mind the unfinished business of HSA with GCN iGPUs . . .
 

3DVagabond

Lifer
Aug 10, 2009
11,951
204
106
Considering that DX12, Vulkan, and Metal are just going to leverage GCN's capabilities, not surpass them. I don't think they need a redesign yet,
 

ThatBuzzkiller

Golden Member
Nov 14, 2014
1,120
260
136
You can't just cut the work unit size like that, because amd uses a trick where the register file is actually split into 4 separate register files for the 4 segments of the 64-unit wavefront, each of which has only a single r/w port from which the operands are read over 4 clocks. 4x single ported register file is much cheaper than a single 4-ported register file, even if you make it a lot smaller.

Umm, no ...

Each of those separate register files are dedicated to their own vector units ...

How AMD got to the size of 64 as their wavefront is due to the fact that it takes 4 cycles to execute an instruction for an entire wavefront on their vector units ...

AMD GCN has at least 3 read ports and 1 write port per 32 bit ALU considering the fact that they support native trinary operations going by their OpenGL extension or their compiler heuristics so that comes to a total of 64 R/W ports per register file which means that a multi-port I/O register file design isn't as expensive as you lead me to believe ...

The reason why GCN has a wavefront size of 64 has to do with their instruction issue rates and the execution pipeline latency. The scheduler can only issue ONE VALU op out of the 4 vector units so it takes at least 4 cycles to issue a vector instruction to each SIMD and it also takes exactly 4 cycles to execute the said instruction too hence (SIMD16*4 cycles) = 64 threads per wavefront ...

What AMD should do to improve their occupancy is by making their ALUs fatter and their schedulers bigger to get to a comfortable granularity of 16 threads per wavefront even if it means sacrificing a several of CUs ...

While these changes may inadvertently improve IPC I don't encourage AMD to improve upon that aspect as it runs counter intuitive to the idea of exploiting parallelism ...
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,422
1,759
136
Umm, no ...

Each of those separate register files are dedicated to their own vector units ...

No. Although as GCN has the number 4 several times at different levels, it's easy to get confused. To put it clearer:

Each SIMD pipe inside a GCN CU has 4 separate 16-wide register files, each with a single r/w port. So a CU has 16 register files.

How AMD got to the size of 64 as their wavefront is due to the fact that it takes 4 cycles to execute an instruction for an entire wavefront on their vector units ...

AMD GCN has at least 3 read ports and 1 write port per 32 bit ALU considering the fact that they support native trinary operations going by their OpenGL extension or their compiler heuristics so that comes to a total of 64 R/W ports per register file which means that a multi-port I/O register file design isn't as expensive as you lead me to believe ...
Yes, but this also allows them to make the register files cheaper because as they do barrel processing over 64 elements in 4 cycles, and they need 3 operands and 1 write, they can split their register file into 4 quarters and they need to touch each quarter exactly four times over four clocks, allowing them to make it a very simple reg file with 1 r/w port.

to illustrate, each wavefront is split into a,b,c,d and each operation requires r1, r2, r3, w:

Code:
            wavefront 1     wavefront 2
reg file a: r1  r2  r3   w  r1  r2  r3   w
reg file b:     r1  r2  r3   w  r1  r2  r3   w
reg file c:         r1  r2  r3   w  r1  r2  r3   w
reg file d:             r1  r2  r3   w  r1  r2  r3   w
(writes are probably delayed over a wavefront so that ex gets full 4 cycles, and results forwarded over once cycle when required.)

The ultimate result is that all execution units get fed from energetically very cheap register files, but that you can only read "in sequence", so you have to use 64-element wavefronts instead of operating on new 16-element vector each cycle.
 

ThatBuzzkiller

Golden Member
Nov 14, 2014
1,120
260
136
No. Although as GCN has the number 4 several times at different levels, it's easy to get confused. To put it clearer:

Each SIMD pipe inside a GCN CU has 4 separate 16-wide register files, each with a single r/w port. So a CU has 16 register files.

https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf

I suggest that you read the above whitepaper ...

The total register file size is 256kb but it is partitioned into exactly 4 slices (64kb) for every vector unit. The paper makes no mention of further sub-slices like you claim ...

Furthermore the VGPRs are 64 lanes wide ...

Yes, but this also allows them to make the register files cheaper because as they do barrel processing over 64 elements in 4 cycles, and they need 3 operands and 1 write, they can split their register file into 4 quarters and they need to touch each quarter exactly four times over four clocks, allowing them to make it a very simple reg file with 1 r/w port.

to illustrate, each wavefront is split into a,b,c,d and each operation requires r1, r2, r3, w:

Code:
            wavefront 1     wavefront 2
reg file a: r1  r2  r3   w  r1  r2  r3   w
reg file b:     r1  r2  r3   w  r1  r2  r3   w
reg file c:         r1  r2  r3   w  r1  r2  r3   w
reg file d:             r1  r2  r3   w  r1  r2  r3   w
(writes are probably delayed over a wavefront so that ex gets full 4 cycles, and results forwarded over once cycle when required.)

The ultimate result is that all execution units get fed from energetically very cheap register files, but that you can only read "in sequence", so you have to use 64-element wavefronts instead of operating on new 16-element vector each cycle.

Not true ...

Whether you have a monolithic multi-ported register file or several divided register files with fewer I/O ports makes no difference when you end up with the same amount of I/O ports with the total size of 64kb ...

Wavefront size makes no difference in this case ...

The execution units may have segmented access per each cycle but the register file is definitely not split into sub-slices with all the evidence we have ...
 

Timmah!

Golden Member
Jul 24, 2010
1,463
729
136
Nice one thanks, but i was talking about the hardware.
With 4K Monitors becoming mainstream the next 2-3 years, the Hardware performance of GPUs will have to be 5-10 times more (perhaps more) than what would need for Ray Tracing at 1080p.

I know. I just dont think there is much difference whether the GPU makers are going for VR and high-res or pathracing, cause ultimately both simply need faster hardware.
 

RussianSensation

Elite Member
Sep 5, 2003
19,458
765
126
Surprised they denote no new changes to the Rasterizer or the ROPs as this is where Fury X is struggling against Maxwell.




Looking forward to reading about other changes. Can't wait for 14/16nm GPUs to finally bring massive improvements in performance.

They need to address the inefficiencies of Fiji since it should have been ~ 50% faster than 290X at high resolution.

Instead, Fury X right now sits at about 36% faster than 290X at 4K.

New geometry processor should help them with overkill GameWorks tessellation while a new command processor should more efficiently handle parallelism/workload dispatch.
 
Last edited:

JDG1980

Golden Member
Jul 18, 2013
1,663
570
136
Surprised they denote no new changes to the Rasterizer or the ROPs as this is where Fury X is struggling against Maxwell.

As far as I can tell, the only problem with AMD's ROPs is that on some of their designs (Tahiti, Tonga, Fiji) they didn't use enough. Tahiti and Tonga have the same 32 ROPs as Pitcairn despite having 60% more shaders, and Fiji has the same 64 ROPs as Hawaii despite having ~45% more shaders. This is why Tonga and Fiji offered somewhat disappointing performance, especially considering their large die sizes. The shaders often went unused because the small ROP count was a bottleneck. This is also why the cut-down Tahiti, Tonga, and Fiji cards were often nearly as good as the full-fat versions.
 

DiogoDX

Senior member
Oct 11, 2012
746
277
136
They don't necessary need to change the ROPs just increase the number. Nvidia went to 2000sps 64ROPs on GM204 to 3000sps 96ROPs on GM200 while Fiji has the same 64ROPs as Hawaii. Same for the rasters.
 

Techhog

Platinum Member
Sep 11, 2013
2,834
2
26
The way AMD is presenting it, Polaris will just be another "generation" of GCN, implying only a few small tweaks yet again. This is really bad. Pascal is going to leave it in the dust...

Also, in before the "I told you so" parade. :/
 

3DVagabond

Lifer
Aug 10, 2009
11,951
204
106
The way AMD is presenting it, Polaris will just be another "generation" of GCN, implying only a few small tweaks yet again. This is really bad. Pascal is going to leave it in the dust...

Also, in before the "I told you so" parade. :/

Did you not look at the slide? Most of the chip is all new. Don't get hung up on a name. Unfortunately for AMD they don't understand the marketing power of simply calling it something different. Again, their engineering degrees are showing instead of their marketing degrees.
 
Mar 10, 2006
11,715
2,012
126
Did you not look at the slide? Most of the chip is all new. Don't get hung up on a name. Unfortunately for AMD they don't understand the marketing power of simply calling it something different. Again, their engineering degrees are showing instead of their marketing degrees.

Yeah, I don't understand the pessimism from that slide. Looks like a solid improvement over the current arch.

Let's see if they can recapture the performance crown. If so, I'll nab a couple of Polaris cards later this summer. If not, a couple of Pascals for me.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |