All what you say is applicable to all benchmarks, not only GB. Do you think it makes more sense to compare SPEC FP results of a CPU without AVX against a CPU with AVX?
SPEC results are also far from perfect, considering the test suite. I have covered this problem
here.
Another problem is how those sets match the real-life performance. For example, Geekbench 6 includes a Clang compilation test. That's great if you are a developer, write your code in C and use Clang.
But if we take PHP, Python, Ruby, and other interpreters, they are heavily optimized for Intel CPUs. They show much higher performance than Apple Silicon (
PHPBench,
PyBench,
Optcarrot). The difference is up to two times for ST tests.
There's no sense in measuring how fast a CPU renders a scene (Cinebench), for example, because in real life, you will be using a GPU for that.
Geekbench also includes a lot of AI tests. That's great, but why should you use a CPU for AI workloads if the dedicated NPU is nearly 10 times faster while consuming less power? It's the same situation as Cinebench.
The same can be said about the Ray Tracer test in Geekbench. Why should you do lighting calculations on a CPU if a GPU is hundreds of times faster because of the dedicated hardware blocks?
So yes, comparing Geekbench scores may be pretty funny, but it's a useless metric to compare the CPU performance on different platforms. To a certain degree, you can compare those results within one platform, but the test suite should be objective. Geekbench 6 is not an objective test.
Ultimately, the only viable metric is the actual performance of the apps you're using on the fixed power limit.