Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis

Hongkang Li, Meng Wang, Shuai Zhang, Sijia Liu, Pin Yu Chen

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Efficient training and inference algorithms, such as low-rank adaption and model pruning, have shown impressive performance for learning Transformer-based large foundation models. However, due to the technical challenges of the non-convex optimization caused by the complicated architecture of Transformers, the theoretical study of why these methods can be applied to learn Transformers is mostly elusive. To the best of our knowledge, this paper shows the first theoretical analysis of the property of low-rank and sparsity of one-layer Transformers by characterizing the trained model after convergence using stochastic gradient descent. By focusing on a data model based on label-relevant and label-irrelevant patterns, we quantify that the gradient updates of trainable parameters are low-rank, which de-pends on the number of label-relevant patterns. We also analyze how model pruning affects the generalization while improving computation efficiency and conclude that proper magnitude-based pruning has a slight effect on the testing performance. We implement numerical experiments to support our findings.

Original languageEnglish (US)
Title of host publication2024 IEEE 13rd Sensor Array and Multichannel Signal Processing Workshop, SAM 2024
PublisherIEEE Computer Society
ISBN (Electronic)9798350344813
DOIs
StatePublished - 2024
Externally publishedYes
Event13rd IEEE Sensor Array and Multichannel Signal Processing Workshop, SAM 2024 - Corvallis, United States
Duration: Jul 8 2024Jul 11 2024

Publication series

NameProceedings of the IEEE Sensor Array and Multichannel Signal Processing Workshop
ISSN (Electronic)2151-870X

Conference

Conference13rd IEEE Sensor Array and Multichannel Signal Processing Workshop, SAM 2024
Country/TerritoryUnited States
CityCorvallis
Period7/8/247/11/24

All Science Journal Classification (ASJC) codes

  • Signal Processing
  • Control and Systems Engineering
  • Electrical and Electronic Engineering

Keywords

  • low-rank adaption
  • mechanism
  • model pruning
  • Transformer

Fingerprint

Dive into the research topics of 'Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis'. Together they form a unique fingerprint.

Cite this