kernelc
Member
does anyone else think that the small L1 caches on bulldozer could be killing performance ?
per core (half module) there is 1/2 the instruction and 1/4 the data compared to stars (correct me if i'm wrong).
Granted the instruction cache is shared, but even then it must feed 2 threads.
Was there any justification for decimating the L1 cache sizes other than saving die space ?
Have they sufficiently compensated (if possible) ?
Looking at These benchmarks comparing bulldozer with a 200MHz clock advantage against x6. Not only are the L1 cache sizes smaller, but the overrall latency and bandwidth of both L1 and L2 are worse.
Based on Anandtech data (http://www.anandtech.com/show/5057/the-bulldozer-aftermath-delving-even-deeper/10) it seems that L1, while small, has decent hit rates.
Surely increase it 2X would be useful, but I think the real culprit is the write-through approach, coupled with the very small WCC cache (4 KB only).
Regards.