Interesting bits for me are:
"Finally, there's better support of variables such as half-floats. To date, with the AMD architectures, a half-float would take the same internal space as a full 32-bit float. There hasn't been much advantage to using them. With Polaris though, it's possible to place two half-floats side by side in a register, which means if you're willing to mark which variables in a shader program are fine with 16-bits of storage, you can use twice as many. Annotate your shader program, say which variables are 16-bit, then you'll use fewer vector registers."
The enhancements in PS4 Pro are also geared to extracting more utilisation from the base AMD compute units.
"Multiple wavefronts running on a CU are a great thing because as one wavefront is going out to load texture or other memory, the other wavefronts can happily do computation. It means your utilisation of vector ALU goes up," Cerny shares.
"Anything you can do to put more wavefronts on a CU is good, to get more running on a CU. There are a limited number of vector registers so if you use fewer vector registers, you can have more wavefronts and then your performance increases, so that's what native 16-bit support targets. It allows more wavefronts to run at the same time."
and:
"We can have custom features and they can eventually end up on the [AMD] roadmap," Cerny says proudly. "So the ACEs... I was very passionate about asynchronous compute, so we did a lot of work there for the original PlayStation 4 and that ended up getting incorporated into subsequent AMD GPUs, which is nice because the PC development community gets very familiar with those techniques. It can help us when the parts of GPUs that we are passionate about are used in the PC space."
In actual fact, two new AMD roadmap features debut in the Pro, ahead of their release in upcoming Radeon PC products - presumably the Vega GPUs due either late this year or early next year.
"One of the features appearing for the first time is the handling of 16-bit variables - it's possible to perform two 16-bit operations at a time instead of one 32-bit operation," he says, confirming what we learned during our visit to VooFoo Studios to check out Mantis Burn Racing. "In other words, at full floats, we have 4.2 teraflops. With half-floats, it's now double that, which is to say, 8.4 teraflops in 16-bit computation. This has the potential to radically increase performance."
A work distributor is also added to the GPU design, designed to improve efficiency through more intelligent distribution of work.
"Once a GPU gets to a certain size, it's important for the GPU to have a centralised brain that intelligently distributes and load-balances the geometry rendered. So it's something that's very focused on, say, geometry shading and tessellation, though there is some basic vertex work as well that it will distribute," Mark Cerny shares, before explaining how it improves on AMD's existing architecture.
"The work distributor in PS4 Pro is very advanced. Not only does it have the fairly dramatic tessellation improvements from Polaris, it also has some post-Polaris functionality that accelerates rendering in scenes with many small objects... So the improvement is that a single patch is intelligently distributed between a number of compute units, and that's trickier than it sounds because the process of sub-dividing and rendering a patch is quite complex."
And:
Beyond that, we're moving into the juicy stuff - the custom hardware that Sony has introduced, elements of the 'secret sauce' that allow the Pro graphics core to punch so far above its weight. In creating 4K framebuffers, a lot of the technological underpinnings are actually based on advanced anti-aliasing work with the creation of new buffers that can be exploited in a number of ways.
Right now, post-process anti-aliasing techniques like FXAA or SMAA have their limits. Edge detection accuracy varies dramatically. Searches based on high contrast differentials, depth or normal maps - or a combination - all have limitations. Sony had fashioned its own, highly innovative solution.
"We'd really like to know where the object and triangle boundaries are when performing spatial anti-aliasing, but contrast, Z [depth] and normal are all imperfect solutions," Cerny says. "We'd also like to track the information from frame to frame because we're performing temporal anti-aliasing. It would be great to know the relationship between the previous frame and the current frame better. Our solution to this long-standing problem in computer graphics is the ID buffer. It's like a super-stencil. It's a separate buffer written by custom hardware that contains the object ID."
It's all hardware based, written at the same time as the Z buffer, with no pixel shader invocation required and it operates at the same resolution as the Z buffer. For the first time, objects and their coordinates in world-space can be tracked, even individual triangles can be identified. Modern GPUs don't have this access to the triangle count without a huge impact on performance.
"As a result of the ID buffer, you can now know where the edges of objects and triangles are and track them from frame to frame, because you can use the same ID from frame to frame," Cerny explains. "So it's a new tool to the developer toolbox that's pretty transformative in terms of the techniques it enables. And I'm going to explain two different techniques that use the buffer - one simpler that's geometry rendering and one more complex, the checkerboard."