TY - GEN
T1 - Opto-ViT
T2 - 44th IEEE/ACM International Conference on Computer-Aided Design, ICCAD 2025
AU - Morsali, Mehrdad
AU - Zhou, Chengwei
AU - Najafi, Deniz
AU - Sarkar, Sreetama
AU - Mercati, Pietro
AU - Khoshavi, Navid
AU - Beerel, Peter
AU - Nikdast, Mahdi
AU - Datta, Gourav
AU - Angizi, Shaahin
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Vision Transformers (ViTs) have emerged as a powerful architecture for computer vision tasks due to their ability to model long-range dependencies and global contextual relationships. However, their substantial compute and memory demands hinder efficient deployment in scenarios with strict energy and bandwidth limitations. In this work, we propose Opto-ViT, the first near-sensor, region-aware ViT accelerator leveraging silicon photonics (SiPh) for real-time and energy-efficient vision processing. Opto-ViT features a hybrid electronic-photonic architecture, where the optical core handles compute-intensive matrix multiplications using Vertical-Cavity Surface-Emitting Lasers (VCSELs) and Microring Resonators (MRs), while nonlinear functions and normalization are executed electronically. To reduce redundant computation and patch processing, we introduce a lightweight Mask Generation Network (MGNet) that identifies regions of interest in the current frame and prunes irrelevant patches before ViT encoding. We further co-optimize the ViT backbone using quantization-aware training and matrix decomposition tailored for photonic constraints. Experiments across device fabrication, circuit and architecture co-design, to classification, detection, and video tasks demonstrate that Opto-ViT achieves 100.4 KFPS/W with up to 84% energy savings with less than 1.6% accuracy loss, while enabling scalable and efficient ViT deployment at the edge.
AB - Vision Transformers (ViTs) have emerged as a powerful architecture for computer vision tasks due to their ability to model long-range dependencies and global contextual relationships. However, their substantial compute and memory demands hinder efficient deployment in scenarios with strict energy and bandwidth limitations. In this work, we propose Opto-ViT, the first near-sensor, region-aware ViT accelerator leveraging silicon photonics (SiPh) for real-time and energy-efficient vision processing. Opto-ViT features a hybrid electronic-photonic architecture, where the optical core handles compute-intensive matrix multiplications using Vertical-Cavity Surface-Emitting Lasers (VCSELs) and Microring Resonators (MRs), while nonlinear functions and normalization are executed electronically. To reduce redundant computation and patch processing, we introduce a lightweight Mask Generation Network (MGNet) that identifies regions of interest in the current frame and prunes irrelevant patches before ViT encoding. We further co-optimize the ViT backbone using quantization-aware training and matrix decomposition tailored for photonic constraints. Experiments across device fabrication, circuit and architecture co-design, to classification, detection, and video tasks demonstrate that Opto-ViT achieves 100.4 KFPS/W with up to 84% energy savings with less than 1.6% accuracy loss, while enabling scalable and efficient ViT deployment at the edge.
UR - https://www.scopus.com/pages/publications/105029370212
UR - https://www.scopus.com/pages/publications/105029370212#tab=citedBy
U2 - 10.1109/ICCAD66269.2025.11240696
DO - 10.1109/ICCAD66269.2025.11240696
M3 - Conference contribution
AN - SCOPUS:105029370212
T3 - IEEE/ACM International Conference on Computer-Aided Design, Digest of Technical Papers, ICCAD
BT - 2025 IEEE/ACM International Conference on Computer-Aided Design, ICCAD 2025 - Conference Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 26 October 2025 through 30 October 2025
ER -