TY - GEN
T1 - Power-performance optimization of a virtualized SMT vector processor via thread fusion and lane configuration
AU - Lu, Yaojie
AU - Rooholamin, Seyedamin
AU - Ziavras, Sotirios G.
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2016/9/2
Y1 - 2016/9/2
N2 - Lane-based Vector Processors (VPs) are highly scalable. However, they become less energy efficient as they scale for vector applications having insufficient data-level parallelism (DLP) to keep the extra computation lanes fully occupied. We present a scalable and yet flexible VP that is capable of dynamically deactivating some of its computing lanes in order to reduce static power with minimum performance loss. In addition, our simultaneous multi-threaded (SMT) VP can exploit identical instruction flows that may be present in different vector applications by running in a novel fused mode that increases its utilization. We introduce a power model and two optimization policies for minimizing the consumed energy, or the product of the energy and runtime for a given application. Benchmarking that involves an FPGA prototype shows up to 33.8% energy reduction in addition to 40% runtime improvement, or up to 62.7% reduction in the product of energy and runtime.
AB - Lane-based Vector Processors (VPs) are highly scalable. However, they become less energy efficient as they scale for vector applications having insufficient data-level parallelism (DLP) to keep the extra computation lanes fully occupied. We present a scalable and yet flexible VP that is capable of dynamically deactivating some of its computing lanes in order to reduce static power with minimum performance loss. In addition, our simultaneous multi-threaded (SMT) VP can exploit identical instruction flows that may be present in different vector applications by running in a novel fused mode that increases its utilization. We introduce a power model and two optimization policies for minimizing the consumed energy, or the product of the energy and runtime for a given application. Benchmarking that involves an FPGA prototype shows up to 33.8% energy reduction in addition to 40% runtime improvement, or up to 62.7% reduction in the product of energy and runtime.
KW - DLP
KW - dynamic configuration
KW - instruction fusion
KW - performance and energy optimization
KW - vector processor
UR - http://www.scopus.com/inward/record.url?scp=84988984785&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84988984785&partnerID=8YFLogxK
U2 - 10.1109/ISVLSI.2016.27
DO - 10.1109/ISVLSI.2016.27
M3 - Conference contribution
AN - SCOPUS:84988984785
T3 - Proceedings of IEEE Computer Society Annual Symposium on VLSI, ISVLSI
SP - 81
EP - 86
BT - Proceedings - IEEE Computer Society Annual Symposium on VLSI, ISVLSI 2016
PB - IEEE Computer Society
T2 - 15th IEEE Computer Society Annual Symposium on VLSI, ISVLSI 2016
Y2 - 11 July 2016 through 13 July 2016
ER -