This lecture was delivered via whiteboard and slides. A draft of the lecture is provided here. Further supporting discussion on parallelism more generally is given here.
- Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. M. Shoeybi, M. Patwary, Raul Puri, P. LeGresley, J. Casper, Bryan Catanzaro. 2019.
- GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Z. Chen. NeurIPS 2018.
- Efficient large-scale language model training on GPU clusters using Megatron-LM. D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V. Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, J. Bernauer, Bryan Catanzaro, Amar Phanishayee, M. Zaharia. SC 2021.
- TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models. Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, Danyang Zhuo, Hao Zhang, D. Song, I. Stoica. ICML 2021.