TY - GEN
T1 - EaseMiss
T2 - 15th IEEE Dallas Circuits and Systems Conference, DCAS 2022
AU - Nezhadi, Ali
AU - Angizi, Shaahin
AU - Roohi, Arman
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - Due to the essential role of matrix multiplication in many scientific applications, especially in data and compute -intensive applications, we explore the efficiency of highly used matrix production algorithms. This paper proposes an HW/SW co-optimization technique, entitled EaseMiss, to reduce the cache miss ratio for large general matrix-matrix multiplications. First, we revise the algorithms by applying three software optimization techniques to improve performance. Choosing the proper algorithms to achieve the best performance is examined and formulated. By leveraging the proposed optimizations, the number of cache misses decreases by a factor of 3 in a conventional data cache. To further improve, we then propose SPLiTCACHE to virtually split data cache regarding matrices' dimensions for better data reuse. This method can be easily embedded into conventional general-purpose processors or GPUs at the cost of negligible logical circuit overhead. After using the correct and valid splitting, the obtained results show that the cache misses reduce by a factor of 2 compared to the conventional data cache on average in the machine learning workloads.
AB - Due to the essential role of matrix multiplication in many scientific applications, especially in data and compute -intensive applications, we explore the efficiency of highly used matrix production algorithms. This paper proposes an HW/SW co-optimization technique, entitled EaseMiss, to reduce the cache miss ratio for large general matrix-matrix multiplications. First, we revise the algorithms by applying three software optimization techniques to improve performance. Choosing the proper algorithms to achieve the best performance is examined and formulated. By leveraging the proposed optimizations, the number of cache misses decreases by a factor of 3 in a conventional data cache. To further improve, we then propose SPLiTCACHE to virtually split data cache regarding matrices' dimensions for better data reuse. This method can be easily embedded into conventional general-purpose processors or GPUs at the cost of negligible logical circuit overhead. After using the correct and valid splitting, the obtained results show that the cache misses reduce by a factor of 2 compared to the conventional data cache on average in the machine learning workloads.
UR - http://www.scopus.com/inward/record.url?scp=85137681716&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85137681716&partnerID=8YFLogxK
U2 - 10.1109/DCAS53974.2022.9845629
DO - 10.1109/DCAS53974.2022.9845629
M3 - Conference contribution
AN - SCOPUS:85137681716
T3 - Proceedings of the 2022 IEEE Dallas Circuits and Systems Conference, DCAS 2022
BT - Proceedings of the 2022 IEEE Dallas Circuits and Systems Conference, DCAS 2022
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 17 June 2022 through 19 June 2022
ER -