Benchmarks Are Probably All Wrong

Mr Evil · Apr 12, 2017

Over the last few years, there has been an undercurrent of uncertainty in the world of game benchmarking about exactly what it is that they are supposed to be measuring. Gone are the days where all you needed was a single FPS number and everyone knew what it meant. I have heard talk of minimums, frame times, variance, 0.1%, medians, and many other things. Then today, on seeing the Level 1 Techs Ryzen vs. Intel test, I realized that that was the first double-blind experiment that I had ever seen applied to gaming.

I know the effects that double-blind experiments can have, turning conventional wisdom on its head by showing that the beliefs people had held as obvious were in fact entirely imaginary. So what are current game benchmarks really telling us? If one system has a better average FPS, but worse 1% frame times than another system, does that make it worse? Can anyone even tell the difference? Is the preception of performance constant, or is it affected by things like monitor size, or the player's age? I want to see some scientific method being applied to this (and to everything else in the world too, while we're at it).

Madpacket · Apr 12, 2017

It's an interesting dilemma. The average high-end gaming computer is so fast you have to really put them under abnormal conditions to notice differences. Double-blind tests can be good if performed in a very well controlled and repeatable environment. The video by level 1 tech was a good effort and in general proves their anticlimactic point, but we need better testing.

There's a bunch of semi-subjective variables involved which are important measure but are often ignored.

How sensitive is the average player to judder?
Where in a benchmark does judder happen? Is it during a non-interactive cut-scene or during a heavy firefight?
What is the ideal maximum framerate before diminishing returns kick in? Does this depend or change on the genre of game?
What is the average delay of input latency (in ms) before a person notices lag? Is this genre dependent?
What is the average DPC latency during a benchmark?

I've thought about this problem for a while. I think one thing we can add to game benchmarks is to account for judder or dropped frames during cut-scenes. We need a way of adding multiple markers or flags over the time the benchmark is run. This way we can take all non-interactive cut scenes (a game like newest Tomb Raider comes to mind) and mark interactive vs non-interactive over the total number of frames. Judder during interactive frames should really be scored against higher than in cut-scenes as it affects directly gameplay. Judder or dropped frames in the wrong spot during an intense battle could change the outcome between winning or losing. Judder or dropped frames in cut-scenes, while annoying, doesn't affect gameplay and should be scored to a lesser degree (IMHO).

For benchmarks that are completely interactive, we could start to measure things that don't show up in benchmarks such as when game asset data is streamed in, or if judder happens in key areas of the game during critical input points. Take Doom for example. The game mainly consists of staged firefights separated by a bit of map exploration to look for secrets along with a few cut scenes. You don't want judder during while making a difficult jump to a far away ledge that holds a secret.

In summary, I think we need more accountability from the testers, a more robust benchmark methodology that translates into a better "user experience", subjective as that may sound.

Mopetar · Apr 12, 2017

Really cool video and this was something that I was wondering about as well since a lot of reviews were reporting anecdotal evidence that Ryzen just felt smoother, even if the frame rates weren't any better or even not as good.

I don't know so much about their conclusion as if you watch the video, they specifically mention boxes 1 and 3 seeming to be the most noticeable across multiple reviewers and multiple games. They may have said it wasn't bad, but they did point out that it seemed noticeably worse even if still acceptable. It's almost like they ran the experiment but then didn't like the results so decided to just conclude there was nothing. It would have been nicer to do this with more participants as well, and there are other ways to improve upon the experiment design as well.

What I suspect may be the cause of this is frame time variance. Tom's hardware had a good set of reviews with information on frame times as well as some variance. Take a look at the FPS results for Hitman:

The R7 1700, is only above 120 FPS about 30% of the time compared to the OC i5 which is above that about 52% of the time. Overall the i5 has a better average frame rate. A stock 1600X even does better than the 1700 which suggests that this game is more thread-limited and cares about IPC and clock speed more than additional threads. However, they also posted another graph that shows the variance in frame time, or rate of shift in FPS:

Here we see the 1700 dominating the 7600K in terms of consistency. ~94% of frames are within 2 ms of the previous frame, and almost no frames are 4+ ms faster/slower than the previous frame. The OC 7600k doesn't fair nearly as well and I suspect that once frame time variance becomes choppy enough, that even with a free-sync or g-sync monitor that the brain starts to pick up on this or feels slightly perturbed by the effect.

I don't know if this is the actual case, but it's what I'd be looking at and trying to isolate. Basically just record some game play and try to record instances where something seems or feels un-smooth and then look at the frame time data to see if those are areas where there was a larger amount of variance. You could even try the same thing with a Ryzen CPU and check to see if those particular areas don't have that variance and if that's when people experience a smooth feeling.

unseenmorbidity · Apr 12, 2017

Mr Evil said:
Over the last few years, there has been an undercurrent of uncertainty in the world of game benchmarking about exactly what it is that they are supposed to be measuring. Gone are the days where all you needed was a single FPS number and everyone knew what it meant. I have heard talk of minimums, frame times, variance, 0.1%, medians, and many other things. Then today, on seeing the Level 1 Techs Ryzen vs. Intel test, I realized that that was the first double-blind experiment that I had ever seen applied to gaming.

I know the effects that double-blind experiments can have, turning conventional wisdom on its head by showing that the beliefs people had held as obvious were in fact entirely imaginary. So what are current game benchmarks really telling us? If one system has a better average FPS, but worse 1% frame times than another system, does that make it worse? Can anyone even tell the difference? Is the preception of performance constant, or is it affected by things like monitor size, or the player's age? I want to see some scientific method being applied to this (and to everything else in the world too, while we're at it).

It would actually be interesting to have a huge study conducted regarding this issue. It would actually be a fairly simple task at any university campus too.

Triloby · Apr 12, 2017

The whole idea behind understanding the differences between frametimes and framerates is something that we have only just understood a few years ago. Scott Wasson, the former Editor-in-Chief at TechReport who now works for AMD's RTG, has actually brought up the importance of frametimes some years ago when gaming benchmarks could not account for the discrepancies of high framerates, but considerable stuttering in games. I believe the Radeon HD 7000 series suffered from considerable frametime latency and stuttering during benchmarks and games, which is when reviewers and people doing benchmarks started putting more emphasis on frametimes and 1% and 0.1% minimums as we know it. Even Anandtech commented on this quite a while ago:

http://www.anandtech.com/show/6857/amd-stuttering-issues-driver-roadmap-fraps

caswow · Apr 12, 2017

Triloby said:
The whole idea behind understanding the differences between frametimes and framerates is something that we have only just understood a few years ago. Scott Wasson, the former Editor-in-Chief at TechReport who now works for AMD's RTG, has actually brought up the importance of frametimes some years ago when gaming benchmarks could not account for the discrepancies of high framerates, but considerable stuttering in games. I believe the Radeon HD 7000 series suffered from considerable frametime latency and stuttering during benchmarks and games, which is when reviewers and people doing benchmarks started putting more emphasis on frametimes and 1% and 0.1% minimums as we know it. Even Anandtech commented on this quite a while ago:

http://www.anandtech.com/show/6857/amd-stuttering-issues-driver-roadmap-fraps

As far as i remember it was also the timeframe when scott wasson was hired by amd.

Stormflux · Apr 12, 2017

Before the reveal, they all specifically mention hitches on systems 1 and 3. They then mentioned that they only tested each game per machine in 5 minute increments at 7:52. 5 minutes and both 1 and 3 had hitches. Extend that to an actual gaming session now. I agree their conclusion conflicts with their own testimonies. Yeah it may not be a big deal, but if it wasn't to begin with why bother testing?

Mopetar · Apr 12, 2017

Stormflux said:
Before the reveal, they all specifically mention hitches on systems 1 and 3. They then mentioned that they only tested each game per machine in 5 minute increments at 7:52. 5 minutes and both 1 and 3 had hitches. Extend that to an actual gaming session now. I agree their conclusion conflicts with their own testimonies. Yeah it may not be a big deal, but if it wasn't to begin with why bother testing?

Yeah, that leaves a bad taste in my mouth. Either they're being dishonest or the video is badly edited to give a false impression.

What they need to do is have a system for scoring or evaluating subjective perception. Normally this is where you record their play or comments and give it to a third party (or better yet multiple parties) who will code (assign the results to a category or create a numeric measure) based on a predefined set of criteria.

That limits the ability for bias to creep in and makes things reproducible.

sm625 · Apr 13, 2017

Nobody has agreed upon a single number to represent performance. I think the average FPS of the slowest 0.1% of frames is the one single most important number. It could be 1% though. Somewhere around there is ideal. I do think the simple average framerate of all frames is kind of pointless and just wastes space on review pages.

I like what Tom's has done with the multicolor graphs, but it seems bordering on overkill. And the "Variance" charts are definitely overkill.

Despoiler · Apr 13, 2017

I'm in support of tracking frametime and frametime variance. In the Ethernet world stats like these have been available for a long time. They are often included in the service level specification of the service level agreement. It's nice to see gaming being challenged in the same ways.

Paul98 · Apr 13, 2017

I wish they would remove frametimes from being used in reviews, it isn't a metric that is helpful to clarify performance. I would MUCH rather have them simply use FPS for everything. You can get the exact same data from fps, and fps is an easier metric to understand. Rather than a frametime, use single frame fps if you want the same data. Frametime is good to use when looking at optimization or at how different things effect a single frame as it's linear.

Paul98 · Apr 13, 2017

If you want a single metric to show overall performance you would have to create a formula that takes these different things into account. But it will be better to simply say what each mean, and and what different values give you the feeling of.

EightySix Four · Apr 13, 2017

As a purchaser benchmarks need to fundamentally answer two questions:

Does the product being benchmarked fit my needs?
How does the product compare to competitors which may also fit my needs?

I really think #2 is such an obsession for hardware fanboys, which drive views/clicks, that we lose a lot of #1. If I want to play X resolution at Y settings on Z game and have a good experience, what's the cheapest piece of hardware I can do that with?

Looking at reviews this way is significantly more challenging and time consuming though because of the massive range of hardware and settings to both understand and test, so instead of working hard to parse this data, the reviewers dump it on the reader so they can make "informed" opinions on their own. What really happens is their eyes glaze over and they look for data points to confirm what they believed before hand.

Mopetar · Apr 13, 2017

sm625 said:
Nobody has agreed upon a single number to represent performance. I think the average FPS of the slowest 0.1% of frames is the one single most important number. It could be 1% though. Somewhere around there is ideal. I do think the simple average framerate of all frames is kind of pointless and just wastes space on review pages.

I like what Tom's has done with the multicolor graphs, but it seems bordering on overkill. And the "Variance" charts are definitely overkill.

Actually I think Tom's does it best of all. The only issue I have is that it's sometimes difficult to read due to the size of the graph. If they made them wider, or offered a zoomed view of the last 10% it would make it easier. Something like .1% frames probably requires multiple runs or an extended in-game benchmark to ensure that it's accurate and not just the result of some random hiccup.

Variance charts may actually be important if they're the cause of the Ryzen "smoothness" that's being reported anecdotally. I would imagine if you took a game and said it had 120 average FPS, but it bounced back and forth between 60 and 180, it probably would probably feel wonky. Obviously that's a very extreme example, but there's probably some threshold where people start to detect that effect. I'd like to see someone study it in detail by recording a long session and just noting when they feel as though it feels janky and then look through the frame time variance at those points to see if that's a culprit.

Bacon1 · Apr 14, 2017

Part 3 shows the massive stutters, looks like they got 3 within their 2 minute test period. I'm not sure why they are seemingly trying to downplay the stuttering as that would drive me crazy if it happened that often. They specifically said that it bothered them in part 2, yet now they are saying they couldn't tell any difference between the two?

Crumpet · Apr 14, 2017

Yeah I watched all these videos and was pleasantly surprised when the stuttering systems were the Intels.

But their conclusion left me very disappointed, their conclusion based upon their own findings should have been along the lines of "all 4 systems we tested today performed adequately for a great gaming experience, however our findings showed that the AMD Ryzen systems had noticeably less micro stuttering in a short gaming session."

3DVagabond · Apr 14, 2017

What I really like is they are actually seeing a problem and then trying to measure it. Rather than the measure and then decide what you are going to see.

Benchmarks Are Probably All Wrong

Mr Evil

Senior member

Madpacket

Platinum Member

Mopetar

Diamond Member

unseenmorbidity

Golden Member

Triloby

Senior member

caswow

Senior member

Stormflux

Member

Mopetar

Diamond Member

sm625

Diamond Member

Despoiler

Golden Member

Paul98

Diamond Member

Paul98

Diamond Member

EightySix Four

Diamond Member

Mopetar

Diamond Member

Bacon1

Diamond Member

Crumpet

Senior member

3DVagabond

Lifer

TRENDING THREADS