New Zen microarchitecture details

Page 122 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

DrMrLordX

Lifer
Apr 27, 2000
22,035
11,620
136
The blender test probably uses all the lean fpu in zen to the most and doesnt benefit from the big dudes i bwe. It shows the zen fpu in the absolute best possible light.

Blender's actually a fairly important rendering tool that is used by many people. If a "big time" app like Blender is showing Zen in the best possible light, then I fail to see the problem. That's a large base of users who will automatically see Zen/Summit Ridge as a viable (if not preferable) product for their usage pattern, which is the only incentive they need to buy it.
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
Blender's actually a fairly important rendering tool that is used by many people. If a "big time" app like Blender is showing Zen in the best possible light, then I fail to see the problem. That's a large base of users who will automatically see Zen/Summit Ridge as a viable (if not preferable) product for their usage pattern, which is the only incentive they need to buy it.

Blender is important, but it's the Cycles itself is not the only renderer available to it. There are other compatible renderers for it, which are more optimized for CPU rendering instead of pure CUDA (and OpenCL) priorities. For CPU rendering Embree is definitely the most optimized ray tracer out there. As for a standard benchmark it could well replace the extremely outdated Cinebenches, as long as the Intel specific paths are disabled from the binaries / libraries. Cinebench, given it's normal release cycle should receive a update soon, since Cinema 4D R18 was just released. All of the current versions are legacy workloads, targeted for over a decade old technology (up to SSE2 in 11.5, up to SSSE3 in R15).
 

krumme

Diamond Member
Oct 9, 2009
5,956
1,595
136
Well Intel definitely does develop different architectures for their low power, high-end and server processors. I mean something like Haswell E or Broadwell E or HEDT Xeon is completely different architecturally from standard desktop Haswell or Broadwell and Xeon respectively, let alone their Atom line of low power CPUs.

Intel have a magnitude more r&d to tailor process for different purposes. Even then they lost like 4B selling Atom to the mobile market yeach year. Intel uses a different layout and will use 1d vs 2d, larger transistor what not, to get high freq as i understand it. Whatever the technical reason; It comes at a cost, be it density or dev cost, but its a choice. But as the Atom case shows there is clearly limits even for Intel how to strech portfolio. AMD is acting within the same restraints with a magnitude lower R&D.

I have absolutely no doubt looking at that challenge from a top strategic perspective what the solution is; Focus and specialize. Cut away and narrow down segments. Cut functionality far more than most beliewe or accept. Heavily reduce complexity in product, processes, organization, marketing, support and sale. Focus.

I respect the positive attitude many developers have. I am sure that makes the impossible done sometimes, and it drives many business forward and its very valuable for the society. But what i have heard from my devs. about cost, time and quality ranges consistently from unrealistic to completely insane. Always. Works like a clockwork, and there is even a theory for it. lol. But hey it gives a fantastic dynamics.
 

bjt2

Senior member
Sep 11, 2016
784
180
86
Blender is important, but it's the Cycles itself is not the only renderer available to it. There are other compatible renderers for it, which are more optimized for CPU rendering instead of pure CUDA (and OpenCL) priorities. For CPU rendering Embree is definitely the most optimized ray tracer out there. As for a standard benchmark it could well replace the extremely outdated Cinebenches, as long as the Intel specific paths are disabled from the binaries / libraries. Cinebench, given it's normal release cycle should receive a update soon, since Cinema 4D R18 was just released. All of the current versions are legacy workloads, targeted for over a decade old technology (up to SSE2 in 11.5, up to SSSE3 in R15).

Cinebench updated right before the shipment of Zen that has 4x128 pipelines, versus INTEL that has 2x256 pipelines and so max 2x128 bit uops per cycle... Mmmmh... I guess that the new version is going to be optimized for 256 bit, right?
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
Cinebench updated right before the shipment of Zen that has 4x128 pipelines, versus INTEL that has 2x256 pipelines and so max 2x128 bit uops per cycle... Mmmmh... I guess that the new version is going to be optimized for 256 bit, right?

I would imagine they will use solution which provides the best possible performance in general? I have no idea what code paths even exist in Cinema 4D, so they could well be 128-bit.
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
Well Intel definitely does develop different architectures for their low power, high-end and server processors. I mean something like Haswell E or Broadwell E or HEDT Xeon is completely different architecturally from standard desktop Haswell or Broadwell and Xeon respectively, let alone their Atom line of low power CPUs.
I think, Zen+ and Zen will be in a similar relationship. But they have to bring one uarch first.
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
I would imagine they will use solution which provides the best possible performance in general? I have no idea what code paths even exist in Cinema 4D, so they could well be 128-bit.
Should be easy to find out with some perf analyzing tools. I'd think, for pixel wise calculations 4 floats in a vector fit the 3D space and transformations rather well. Even VLIW4 was based on this and rgba. Would double precision make sense? Not for most use cases to please a human eye, while they would halve the pixel throughput.

Transformations of vertex arrays (games/gfx drivers if not done on the GPU), media processing, DGEMM/FFTs and other more scientific stuff are good use cases for 256b.
 

bjt2

Senior member
Sep 11, 2016
784
180
86
I would imagine they will use solution which provides the best possible performance in general? I have no idea what code paths even exist in Cinema 4D, so they could well be 128-bit.

You said R15 use up to SSSE3... AFAIK it's 128 bit max...
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
You said R15 use up to SSSE3... AFAIK it's 128 bit max...

R15 isn't the upcoming version, or is it?
I would expect that they have changed the code somewhat during three major versions and three years.
 

bjt2

Senior member
Sep 11, 2016
784
180
86
R15 isn't the upcoming version, or is it?
I would expect that they have changed the code somewhat during three major versions and three years.

R15 is the one used since some years... The 11.5 does not scale well with number of cores... Instead R15 scales almost linearly... So it gained immediately the trust of most sites...
 

DrMrLordX

Lifer
Apr 27, 2000
22,035
11,620
136
Blender is important, but it's the Cycles itself is not the only renderer available to it. There are other compatible renderers for it, which are more optimized for CPU rendering instead of pure CUDA (and OpenCL) priorities. For CPU rendering Embree is definitely the most optimized ray tracer out there. As for a standard benchmark it could well replace the extremely outdated Cinebenches, as long as the Intel specific paths are disabled from the binaries / libraries. Cinebench, given it's normal release cycle should receive a update soon, since Cinema 4D R18 was just released. All of the current versions are legacy workloads, targeted for over a decade old technology (up to SSE2 in 11.5, up to SSSE3 in R15).

A good point. Have we seen hardware-oriented sites (like Anandtech etc) show the different Blender renderers in action in platforms/processors of interest? I know they have done runs with multiple versions of Cinebench so I have to wonder why they wouldn't do something similar with Blender, unless they focus on Embree since it is (as you say) the most CPU-optimized render engine available.
 

cdimauro

Member
Sep 14, 2016
163
14
61
I would imagine they will use solution which provides the best possible performance in general? I have no idea what code paths even exist in Cinema 4D, so they could well be 128-bit.
Then why you have stated this:
As for a standard benchmark it could well replace the extremely outdated Cinebenches, as long as the Intel specific paths are disabled from the binaries / libraries.
before?

Anyway, why these paths (if any) should be disabled?
 

cdimauro

Member
Sep 14, 2016
163
14
61
Should be easy to find out with some perf analyzing tools. I'd think, for pixel wise calculations 4 floats in a vector fit the 3D space and transformations rather well. Even VLIW4 was based on this and rgba. Would double precision make sense? Not for most use cases to please a human eye, while they would halve the pixel throughput.
It depends. If you use more precision than the canonical 8 bits for color components, the 24 bit precision offered by single precision's mantissa might not be enough to handle the massive calculations of heavy algorithms like ray tracing.
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
First you talked about removing the code-paths for Intel processors on Cinebench, and then you stated that you don't know if there is any.

Ahem...

Blender is important, but it's the Cycles itself is not the only renderer available to it. There are other compatible renderers for it, which are more optimized for CPU rendering instead of pure CUDA (and OpenCL) priorities. For CPU rendering Embree is definitely the most optimized ray tracer out there. As for a standard benchmark it could well replace the extremely outdated Cinebenches, as long as the Intel specific paths are disabled from the binaries / libraries. Cinebench, given it's normal release cycle should receive a update soon, since Cinema 4D R18 was just released. All of the current versions are legacy workloads, targeted for over a decade old technology (up to SSE2 in 11.5, up to SSSE3 in R15).
 

cdimauro

Member
Sep 14, 2016
163
14
61
You said R15 use up to SSSE3... AFAIK it's 128 bit max...
Yes, it's 128-bit, and what's worse is that it's the old SIMD (with destructive 2 operands operations). So, not even AVX, which is much nicer (native support to non-destructive 3 operands) and powerful (much less MOVs are required).
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
OK, got it. I regret.

Anyway, you haven't answered my other question: why such code paths (if they really exist) should be removed?

So that each and every CPU will use the same code path. For Cinebench this is not a issue, since they are not compiled with the option which allows creating alternative code paths (which are optimized for Intel). Intel compiler can generate three types of code. The MSVC compatible "arch" option which creates a single code path without any alternatives, or "Qax" option which adds an additional "Intel only" codepath to the same "arch" binary. The third option "Qx" will generate code which is fully optimized for specific Intel µarch and will not execute on AMD or VIA CPUs, even if they supported all the required instructions. If "arch" option is used, the set level becomes a hard requirement same way as on GCC for example. Meaning if you compile the binaries with "arch:AVX" option, a CPU which doesn't support AVX cannot execute the produced code. This is probably one of the reasons why Cinebenches use so old instruction sets.

If a binary is compiled with "Qax:CORE-AVX512" option for example, it will contain code paths for all instruction sets up to AVX512. These code paths are selected based on the actual capabilities on the CPU, which makes it an ideal solution. GCC or MSVC cannot automatically create a dispatcher and all of the binaries are hard coded for certain instruction sets. So if you want to create a benchmark which uses AVX2, but can still be used on older CPUs without AVX2 support then these are the options you have: A) Write a custom dispatcher to your application, B) produce multiple binaries for different instruction sets C) use Intel compiler with Qax option. The only issue in using the Qax option is that Intel CPUs will get their own code path. When Qax option is used for a benchmark, the dispatcher which allows separate codepaths for Intel must be disabled. Either so that no CPU, regardless of the vendor (AMD, Intel, VIA) use the Intel specific codepath or so that all CPUs use it. Disabling the dispatcher is extremely easy, you basically have to patch three instructions. They are always the same regardless the binary, so it is extremely easy to do. For my own software and tools I always do it, since I use both AMD and Intel CPUs.

As I said, Embree is extremely well optimized but it uses Intel's own libraries (IPP and TBB at least) and without patching away the dispatchers in them it cannot be considered as a fair benchmark.
Cinebench doesn't use any proprietary libraries from Intel and it is multithreaded with OpenMP instead of TBB.
 

cdimauro

Member
Sep 14, 2016
163
14
61
CPUs are different, and that's why it makes sense to have differently optimized binaries, if possible: to squeeze the most out of them.

Which, unfortunately, requires going down to the uarch level, and that's why come the problems, even with Intel's CPUs (only some of them benefit from specific optimizations: all other get the same code path of AMD and VIA CPUs).

If you buy a CPU, I think that you might be interested on gaining the best performances.

Intel provides own compilers (C++, Fortran) for that reason (but, again: ONLY for SOME uarchitectures). AMD, VIA, or other CPU vendors can do/offer the same for their uarchitectures: nobody stops them.

BTW, Intel compilers aren't for free. So only people which want to invest on getting better performance buy them.

And since we are talking about software for professionals (Cinema4D isn't used by the average Joe), it makes perfectly sense.

So, and to recap, I don't see reasons why specific code paths (for whatever CPU: the principle is the same) should be disabled.
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
A benchmark used to evaluate the performance of different hardware can have exactly zero differences in the code executed on similar hardware (i.e supported instructions), regardless of the hardware manufacturer. Otherwise it doesn't qualify as a good benchmark. Real world applications are a different story, obviously in them the code is optimized as much for the target platform as possible.
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
Users run real-world applications...

Of course they do.
But the purpose of benchmarking is to find out the differences in the performance of the hardware, not how well or poorly certain code is optimized for a certain processor.
 

cdimauro

Member
Sep 14, 2016
163
14
61
For which reasons? Are people interested on knowing abstract numbers about different hardwares, or about how perform the applications that they daily use?
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |