Some parameter combinations are hard tomeasure, even with carefully tuned C code:
- Reduced performance for large strides and small working-sets in L1 caches is a measurement artifact and not architecture related.
- Compilers occasionally generate suboptimal instruction schedules for loads / stores.
Some parameter combinations are hard to measure, even with carefully tuned C code.
There is no penalty for higher strides, but the bandwidth becomes difficult to measure due to loop overhead and other constant overheads in the micro-benchmark and the performance ridge in the stride/working-set diagram falls off without immediate reason. In this zone the diagram rather reflects what is achievable by a compiler than what the hardware can do in theory.