TY - JOUR
T1 - Versatile design of shared vector coprocessors for multicores
AU - Beldianu, Spiridon F.
AU - Dahlberg, Christopher
AU - Steele, Timothy
AU - Ziavras, Sotirios G.
N1 - Funding Information:
He received a National Science Foundation (NSF) Research Initiation Award in 1991, as well as many other grants from the NSF, Department of Energy, etc. He has served as an Associate Editor of the Pattern Recognition journal and other journals, and serves regularly as a member of Conference Program Committees. He has also served as the Program Chair for IEEE-sponsored Conferences. He has authored more than 160 published research papers.
PY - 2012/10
Y1 - 2012/10
N2 - For a wide range of applications that make use of a vector coprocessor, its resources are not highly utilized due to the lack of sustained data parallelism, which often occurs due to insufficient vector parallelism or vector-length variations in dynamic environments. The motivation of our work stems from (a) the omnipresence of vector operations in high-performance scientific and emerging embedded applications; (b) the mandate for multicore designs to make efficient use of on-chip resources for low power and high performance; (c) the need to often handle a variety of vector sizes; and (d) vector kernels in application suites may have diverse computation needs. Our objective is to provide a versatile design framework that can facilitate vector coprocessor sharing among multiple cores in a manner that maximizes resource utilization while also yielding very high performance at reduced area and energy costs. We have previously proposed three basic shared vector coprocessor architectures based on coarse-grain temporal, fine-grain temporal, and vector lane sharing that were implemented in SystemVerilog [15]. Our new paper presents substantially improved versions of these architectures that are implemented in synthesized RTL for higher accuracy. We herein evaluate these vector coprocessor sharing policies for a dual-core system using the floating-point performance, resource utilization and power consumption metrics. Benchmarking for FIR filtering, FFT, matrix multiplication, LU decomposition and sparse matrix vector multiplication shows that these coprocessor sharing policies yield high utilization, high performance and low energy per operation. Fine-grain temporal sharing most often provides the best performance among the three policies; it is followed by vector lane and then coarse-grain temporal sharing. It is also shown that, per core exclusive access to the vector resources does not maximize their utilization. This benchmarking involves various scenarios for each application, where the scenarios differ in terms of the vector length and the parallelism-oriented coding technique.
AB - For a wide range of applications that make use of a vector coprocessor, its resources are not highly utilized due to the lack of sustained data parallelism, which often occurs due to insufficient vector parallelism or vector-length variations in dynamic environments. The motivation of our work stems from (a) the omnipresence of vector operations in high-performance scientific and emerging embedded applications; (b) the mandate for multicore designs to make efficient use of on-chip resources for low power and high performance; (c) the need to often handle a variety of vector sizes; and (d) vector kernels in application suites may have diverse computation needs. Our objective is to provide a versatile design framework that can facilitate vector coprocessor sharing among multiple cores in a manner that maximizes resource utilization while also yielding very high performance at reduced area and energy costs. We have previously proposed three basic shared vector coprocessor architectures based on coarse-grain temporal, fine-grain temporal, and vector lane sharing that were implemented in SystemVerilog [15]. Our new paper presents substantially improved versions of these architectures that are implemented in synthesized RTL for higher accuracy. We herein evaluate these vector coprocessor sharing policies for a dual-core system using the floating-point performance, resource utilization and power consumption metrics. Benchmarking for FIR filtering, FFT, matrix multiplication, LU decomposition and sparse matrix vector multiplication shows that these coprocessor sharing policies yield high utilization, high performance and low energy per operation. Fine-grain temporal sharing most often provides the best performance among the three policies; it is followed by vector lane and then coarse-grain temporal sharing. It is also shown that, per core exclusive access to the vector resources does not maximize their utilization. This benchmarking involves various scenarios for each application, where the scenarios differ in terms of the vector length and the parallelism-oriented coding technique.
KW - Coprocessor sharing
KW - FPGA prototyping
KW - Multicore
KW - Vector coprocessor
UR - http://www.scopus.com/inward/record.url?scp=84865849653&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84865849653&partnerID=8YFLogxK
U2 - 10.1016/j.micpro.2012.05.004
DO - 10.1016/j.micpro.2012.05.004
M3 - Article
AN - SCOPUS:84865849653
SN - 0141-9331
VL - 36
SP - 543
EP - 554
JO - Microprocessors and Microsystems
JF - Microprocessors and Microsystems
IS - 7
ER -