It was being disputed, that's the whole point.
But the score for a benchmark like SPEC or GB5 which perform a variety of functions is much more useful than the score from Blender, which performs a single function.
Agreed, but again, that wasn't the point.
Any benchmark, whether it is a single function or a mix of functions or the average of a bunch of different benchmarks is not a good way to determine how something will perform for you - you want to run the actual application mix your server will be running to determine that. If you are running Blender 100% of the time then that's a great benchmark for you. If you are running Blender 5% of the time, Linux kernel compiles 5% of the time, and so forth for a bunch of other stuff then a suite that works like SPEC or GB but used your particular application mix would your ideal benchmark - but such a thing will never exist unless you write it for yourself (and good luck getting others to care)
That's why others (and myself) have said that they want to see independent tests across a large swath of applications. Obviously no review is going to hit every single use case for every single person, but if you have a few database tests with various conditions + a few rendering tests with different scenes and applications + some compile tests with different source codes and compilers, etc. you start to get a pretty good idea of performance in each area. Then people can look at the area that fits their use case. Now do this across a few different review sites and you get a really good understanding of performance. I mean, isn't this how we've done it for years and years with competing x86 CPUs? Why is it all of a sudden pointless to do for ARM CPUs? Obviously there are challenges to it as the available production/gaming software on ARM is limited compared to x86, but there's still lots of examples that you could use.
I will say that part of my problem with SPEC and GB is that even though it touches many different workloads, it does so at a very superficial level for most tests. Just look at the transcoding test in SPEC. It processes 30 seconds of a single video converting from one format to another. No filtering, no resizing, no color correcting, nothing. While that is a nice brief look at performance, I'd much rather see a more thorough test. I mean, Geekbench goes through, I think, 20 or so different tests within a few minutes on a modern CPU. That's a very superficial look at each of the workloads. I'm not saying it's not a valid place to start, but I'm not going to hang my hat on the results either.
What we seem to be seeing is "ARM (especially Apple's) scores too high compared with Intel on Geekbench, therefore Geekbench is a bad benchmark". Lather, rinse, repeat with SPEC. If someone compiles Blender to run on an iPhone, and it compares too well with Intel then the goalposts will be moved again.
I never once have said this or even implied it and I don't think most have either. There's a difference between saying ARM scored too high and saying that's a limited amount of tests and let's see a more full test suite.
The argument has been used several times here that "If Geekbench and SPEC are good enough, why does Anandtech run all these other benchmarks?" They do it because those benchmarks all represent something that some people do. I may not care about a gaming benchmark, you may not care about a database benchmark, another guy may not care about an Office task benchmark, but they all provide valuable info for people who do care about those things. That's why Anandtech runs that variety of benchmarks. It isn't providing much as far as improving the "big picture" of performance - no one is trying to add up all those numbers from various benchmarks and provide a single performance figure. They can't, its not possible. But it lets people look more closely at benchmarks that represent what they do.
I'm confused by your point, if SPEC/GB gives you the info you need to know how CPUs will perform in comparison to each other, why are we wasting time looking at all these variety of benchmarks. Why can't we just look at SPEC and/or GB and pick our CPUs? You seem to be arguing here that we need a variety of tests to know how CPUs will perform in different workloads that might not be reflected by SPEC/GB which is the only thing people have been asking for.