TY - GEN
T1 - Soft-OLP
T2 - 2009 18th International Conference on Parallel Architectures and Compilation Techniques, PACT 2009
AU - Lu, Qingda
AU - Lin, Jiang
AU - Ding, Xiaoning
AU - Zhang, Zhao
AU - Zhang, Xiaodong
AU - Sadayappan, P.
PY - 2009
Y1 - 2009
N2 - Performance degradation of memory-intensive programs caused by the LRU policy's inability to handle weaklocality data accesses in the last level cache is increasingly serious for two reasons. First, the last-level cache remains in the CPU's critical path, where only simple management mechanisms, such as LRU, can be used, precluding some sophisticated hardware mechanisms to address the problem. Second, the commonly used shared cache structure of multi-core processors has made this critical path even more performance-sensitive due to intensive inter-thread contention for shared cache resources. Researchers have recently made efforts to address the problem with the LRU policy by partitioning the cache using hardware or OS facilities guided by run-time locality information. Such approaches often rely on special hardware support or lack enough accuracy. In contrast, for a large class of programs, the locality information can be accurately predicted if access patterns are recognized through small training runs at the data object level. To achieve this goal, we present a system-software framework referred to as Soft-OLP (Software-based Object-Level cache Partitioning). We first collect per-object reuse distance histograms and inter-object interference histograms via memory-trace sampling. With several low-cost training runs, we are able to determine the locality patterns of data objects. For the actual runs, we categorize data objects into different locality types and partition the cache space among data objects with a heuristic algorithm, in order to reduce cache misses through segregation of contending objects. The object-level cache partitioning framework has been implemented with a modified Linux kernel, and tested on a commodity multi-core processor. Experimental results show that in comparison with a standard L2 cache managed by LRU, Soft-OLP significantly reduces the execution time by reducing L2 cache misses across inputs for a set of single- and multi-threaded programs from the SPEC CPU2000 benchmark suite, NAS benchmarks and a computational kernel set.
AB - Performance degradation of memory-intensive programs caused by the LRU policy's inability to handle weaklocality data accesses in the last level cache is increasingly serious for two reasons. First, the last-level cache remains in the CPU's critical path, where only simple management mechanisms, such as LRU, can be used, precluding some sophisticated hardware mechanisms to address the problem. Second, the commonly used shared cache structure of multi-core processors has made this critical path even more performance-sensitive due to intensive inter-thread contention for shared cache resources. Researchers have recently made efforts to address the problem with the LRU policy by partitioning the cache using hardware or OS facilities guided by run-time locality information. Such approaches often rely on special hardware support or lack enough accuracy. In contrast, for a large class of programs, the locality information can be accurately predicted if access patterns are recognized through small training runs at the data object level. To achieve this goal, we present a system-software framework referred to as Soft-OLP (Software-based Object-Level cache Partitioning). We first collect per-object reuse distance histograms and inter-object interference histograms via memory-trace sampling. With several low-cost training runs, we are able to determine the locality patterns of data objects. For the actual runs, we categorize data objects into different locality types and partition the cache space among data objects with a heuristic algorithm, in order to reduce cache misses through segregation of contending objects. The object-level cache partitioning framework has been implemented with a modified Linux kernel, and tested on a commodity multi-core processor. Experimental results show that in comparison with a standard L2 cache managed by LRU, Soft-OLP significantly reduces the execution time by reducing L2 cache misses across inputs for a set of single- and multi-threaded programs from the SPEC CPU2000 benchmark suite, NAS benchmarks and a computational kernel set.
KW - Cache partitioning
KW - Page coloring
KW - Reuse distance
KW - Software-controlled caching
UR - http://www.scopus.com/inward/record.url?scp=70449652924&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=70449652924&partnerID=8YFLogxK
U2 - 10.1109/PACT.2009.35
DO - 10.1109/PACT.2009.35
M3 - Conference contribution
AN - SCOPUS:70449652924
SN - 9780769537719
T3 - Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT
SP - 246
EP - 257
BT - Proceedings - 2009 18th International Conference on Parallel Architectures and Compilation Techniques, PACT 2009
Y2 - 12 September 2009 through 16 September 2009
ER -