To capitalize on multicore power, modern high-speed data transfer applications usually adopt multi-threaded design and aggregate multiple network interfaces. However, NUMA introduces another dimension of complexity to these applications. In this paper, we undertook comprehensive experiment on real systems to illustrate the importance of NUMA-awareness to applications with intensive memory accesses and network I/Os. Instead of simply attributing the NUMA effect to the physical layout, we provide an in-depth analysis of underlying interactions inside hardware devices. We profile the system performance by monitoring relevant hardware counters, and reveal how the NUMA penalty occurs during prefetch and cache synchronization processes. Consequently, we implement a thread mapping module in a bulk data transfer software, BBCP, as a practical example of enabling NUMA-awareness. The enhanced application is then evaluated on our high-performance testbed with storage area networks (SAN). Our experimental results show that the proposed NUMA optimizations can significantly improve BBCP's performance in memory-based tests with various contention levels and realistic data transfers involving SAN-based storage.
All Science Journal Classification (ASJC) codes
- Hardware and Architecture
- Computer Networks and Communications
- High-performance data transfer
- Multicore system
- Non-uniform memory access (NUMA) awareness