I'm traveling right now, so I don't have the raw data with me (I ran the experiments myself). However, I can tell you how I arrived at the conclusion, and you can verify if you wish.
My configuration:
AMD PhenomII X6 1090T
8GB RAM
HD7850 2GB, stock setting of 860/1200 I believe.
Windows 7 SP1
Latest AMD drivers, 14.4 (the one that was just release, not beta)
3600x1920 eyefinity
I ran through Star Swarm with max detail at my resolution with raw data output enabled. After the test, I plotted batch count vs millisecond time to do the batch. Compare Mantle chart against D3D chart.
I found that Mantle scales very well with high batch counts while D3D does not. However, the behavior of each of these APIs reveals their level of optimization. Mantle response timing scales very well when it comes to increased batch counts, being significantly faster than D3D towards the higher batch counts. Towards lower batches, however, their response times are similar. What's interesting is how tightly banded the D3D timings are with respect to batches. Mantle response timings are generally uniformly distributed along the same range with respect to batch counts, making it seem like it is independent of batch counts.
I'm not sure if you have done deep dive on application performance tunings before, but tight bands indicates optimization. It is very rare for something untuned and unoptimized to have such tight banding out of the box.
I plan on tuning Mantle runs a bit more, if possible. I did notice that Mantle spread the load evenly between my 6 cores during the run while D3D only pushed 2 cores. While it is unlikely for context switching to cause jitter at the millisecond level, this is something that needs to be verified.