I would double check with the diassembly. You can use
https://godbolt.org/ for that, if diasebling the actual binary would prove problematic.
I mean since the loop does not touch the volatile variable compiler should be able to replace the loop with the constant unless the reinterpret_cast is confusing it.
And if the loop is not optimized away the compiler won't be able to unroll it due to carried loop dependency. Since most cpus are able to execute 2 fadds per cycle, you are leaving performance on the table. Introducing another accumulator could help with this. Then you could try to account for add latency but since I guess the benchmark is trying to be generic, that would be going too far.
The simplest way to ensure the compiler will not optimize the loop away is to store accumulators to memory, ideally outside the loop. Load the initial value also outside of the loop from memory so compiler won't be able to assume anything about the data.