TY - JOUR
T1 - LAGC
T2 - Lazily Aggregated Gradient Coding for Straggler-Tolerant and Communication-Efficient Distributed Learning
AU - Zhang, Jingjing
AU - Simeone, Osvaldo
N1 - Funding Information:
Manuscript received May 22, 2019; revised December 9, 2019; accepted March 2, 2020. Date of publication April 3, 2020; date of current version March 1, 2021. This work was supported by the European Research Council through the European Union’s Horizon 2020 Research and Innovation Program under Grant 725731. (Corresponding author: Jingjing Zhang.) The authors are with the Department of Informatics, King’s College London, London WC2R 2LS, U.K. (e-mail: jingjing.1.zhang@kcl.ac.uk; osvaldo.simeone@kcl.ac.uk).
Publisher Copyright:
© 2012 IEEE.
PY - 2021/3
Y1 - 2021/3
N2 - Gradient-based distributed learning in parameter server (PS) computing architectures is subject to random delays due to straggling worker nodes and to possible communication bottlenecks between PS and workers. Solutions have been recently proposed to separately address these impairments based on the ideas of gradient coding (GC), worker grouping, and adaptive worker selection. This article provides a unified analysis of these techniques in terms of wall-clock time, communication, and computation complexity measures. Furthermore, in order to combine the benefits of GC and grouping in terms of robustness to stragglers with the communication and computation load gains of adaptive selection, novel strategies, named lazily aggregated GC (LAGC) and grouped-LAG (G-LAG), are introduced. Analysis and results show that G-LAG provides the best wall-clock time and communication performance while maintaining a low computational cost, for two representative distributions of the computing times of the worker nodes.
AB - Gradient-based distributed learning in parameter server (PS) computing architectures is subject to random delays due to straggling worker nodes and to possible communication bottlenecks between PS and workers. Solutions have been recently proposed to separately address these impairments based on the ideas of gradient coding (GC), worker grouping, and adaptive worker selection. This article provides a unified analysis of these techniques in terms of wall-clock time, communication, and computation complexity measures. Furthermore, in order to combine the benefits of GC and grouping in terms of robustness to stragglers with the communication and computation load gains of adaptive selection, novel strategies, named lazily aggregated GC (LAGC) and grouped-LAG (G-LAG), are introduced. Analysis and results show that G-LAG provides the best wall-clock time and communication performance while maintaining a low computational cost, for two representative distributions of the computing times of the worker nodes.
KW - Adaptive selection
KW - coding
KW - distributed learning
KW - gradient descent (GD)
KW - grouping
UR - http://www.scopus.com/inward/record.url?scp=85102292227&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85102292227&partnerID=8YFLogxK
U2 - 10.1109/TNNLS.2020.2979762
DO - 10.1109/TNNLS.2020.2979762
M3 - Article
C2 - 32287013
AN - SCOPUS:85102292227
SN - 2162-237X
VL - 32
SP - 962
EP - 974
JO - IEEE Transactions on Neural Networks and Learning Systems
JF - IEEE Transactions on Neural Networks and Learning Systems
IS - 3
M1 - 9056809
ER -